fastest way to create 2d array - performance

I have been testing on fastest methods to create and initialize 2d matrix in rust.
Methods I tried (Initialize 1000x1000 2d array):
Method
Time
Array2::<i32>::zeros((width, height)) (using ndarray package)
47μs
vec![vec![0; height]; width]
8.0649ms
[[u32; width]; height] = [[0; width]; height]
301.4µs
How ndarray package can initialize 2d array so fast? Can we achieve same speed using primitives type?

Because under the hood it's allocating a one-dimensional vector. Try this instead:
vec![0; width * height]
A zero-initialized vector of a numeric type can be created very quickly because initialization isn't necessary. However, your second method allocates width + 1 vectors, and requires initialization of the outer container. This is not only needlessly wasteful in terms of memory consumption, but will likely cause poor data locality and possible heap fragmentation as nothing requires the heap allocations for the inner vectors to be near to each other. The performance hit here is almost certainly due to having to perform 1,000 extra heap allocations.
The third method creates an array of arrays. This lays out the data contiguously, so is better than the second method, but with this approach Rust can't skip the initialization step like it can when allocating from the heap, and so it takes a bit longer.

Related

Multiply Eigen::Matrices without transposing first

Say you have Matrix<int, 4, 2> and Matrix<int, 3, 2> which you want to multiply in the natural way that consumes the -1 dimension without first transposing.
Is this possible? Or do we have to transpose first. Which would be silly (unperformative) from a cache perspective, because now the elements we are multiplying and summing aren't contiguous.
Here's a playground. https://godbolt.org/z/Gdj3sfzcb
Pytorch provides torch.inner and torch.tensordot which do this.
Just like in Numpy, transpose() just creates a "view". It doesn't do any expensive memory operations (unless you assign it to a new matrix). Just call a * b.transpose() and let Eigen handle the details of the memory access. A properly optimized BLAS library like Eigen handles the transposition on smaller tiles in temporary memory for optimal performance.
Memory order still matters for fine tuning though. If you can, write your matrix multiplications in the form a.transpose() * b for column-major matrices (like Eigen, Matlab), or a * b.transpose() for row-major matrices like those in Numpy. That saves the BLAS library the trouble of doing that transposition.
Side note: You used auto for your result. Please read the Common Pitfalls chapter in the documentation. Your code didn't compute a matrix multiplication, it stored an expression of one.

C++ - need suggestion on 2-d Data structure - size of 1-D is not fixed

I wish to have a 2-D data structure to store SAT formulas. What I want is similar to 2-D arrays. But I wish to do dynamic memory allocation.
I was thinking in terms of array of vectors of the following form :-
typedef vector<int> intVector;
intVector *clause;
clause = new intVector [numClause + 1] ;
Thus clause[0] will be one vector, clause[1] will be another and so on. And each vector may have a different size. But I am unsure if this is the right thing to do in terms of memory allocation.
Can an array of vectors be made, so that the vectors have different sizes ? How bad is it on memory management ?
Thanks.
To memory management will be fine using STL (dynamic memory alloc), performance will go down because is dynamic and doesn't do direct access.
If you're in complete dispair you may use a vector of pairs, with the first as an static array and the second, the count of used elements. If some array grows bigger, you may reallocate with more memory and increase the count. This will be a big mess, but may work. You'll lose less memory and make direct acess to the second level of arrays, but is ugly and dirt.

Efficiency of appending to vectors

Appending an element onto a million-element ArrayList has the cost of setting one reference now, and copying one reference in the future when the ArrayList must be resized.
As I understand it, appending an element onto a million-element PersistenVector must create a new path, which consists of 4 arrays of size 32. Which means more than 120 references have to be touched.
How does Clojure manage to keep the vector overhead to "2.5 times worse" or "4 times worse" (as opposed to "60 times worse"), which has been claimed in several Clojure videos I have seen recently? Has it something to do with caching or locality of reference or something I am not aware of?
Or is it somehow possible to build a vector internally with mutation and then turn it immutable before revealing it to the outside world?
I have tagged the question scala as well, since scala.collection.immutable.vector is basically the same thing, right?
Clojure's PersistentVector's have special tail buffer to enable efficient operation at the end of the vector. Only after this 32-element array is filled is it added to the rest of the tree. This keeps the amortized cost low. Here is one article on the implementation. The source is also worth a read.
Regarding, "is it somehow possible to build a vector internally with mutation and then turn it immutable before revealing it to the outside world?", yes! These are known as transients in Clojure, and are used for efficient batch changes.
Cannot tell about Clojure, but I can give some comments about Scala Vectors.
Persistent Scala vectors (scala.collection.immutable.Vectors) are much slower than an array buffer when it comes to appending. In fact, they are 10x slower than the List prepend operation. They are 2x slower than appending to Conc-trees, which we use in Parallel Collections.
But, Scala also has mutable vectors -- they're hidden in the class VectorBuilder. Appending to mutable vectors does not preserve the previous version of the vector, but mutates it in place by keeping the pointer to the rightmost leaf in the vector. So, yes -- keeping the vector mutable internally, and than returning an immutable reference is exactly what's done in Scala collections.
The VectorBuilder is slightly faster than the ArrayBuffer, because it needs to allocate its arrays only once, whereas ArrayBuffer needs to do it twice on average (because of growing). Conc.Buffers, which we use as parallel array combiners, are twice as fast compared to VectorBuilders.
Benchmarks are here. None of the benchmarks involve any boxing, they work with reference objects to avoid any bias:
comparison of Scala List, Vector and Conc
comparison of Scala ArrayBuffer, VectorBuilder and Conc.Buffer
More collections benchmarks here.
These tests were executed using ScalaMeter.

How to use fixed matrix in Eigen?

I have a big matrix which is 1000*500. But how to use Eigen fixed matrix for speed up? The dynamic matrix is slow.
Using fixed size matrices for such large matrices is meaningless. Recall that the advantages of fixed-size matrices are 1) allocation on the stack if requested, and 2) explicit unrolling.
If you think the computation you are performing is too slow, then be specific about your computation. Moreover, make sure you benchmark with compiler optimisations ON. Because of the heavy use of templates, Eigen is particularly slow in debug mode.
Finally, for the record, here is how you can create a fixed size matrix of arbitrary size, e.g., a 6x8 matrix of doubles:
Matrix<double, 6, 8> mat;

vector --> concurrent_vector migration + OpenGL restriction

I need to speed-up some calculation and result of calculation then used to draw OpenGL model.
Major speed-up archived when I changed std::vector to Concurrency::concurrent_vector and used parallel_for instead of just for loops.
This vector (or concurrent_vector) calculated in for (or parallel_for) loop and contains vertices for OpenGL to visualize.
It is fine using std::vector because OpenGL rendering procedure relies on the fact that std::vector keeps it's items in sequence which is not a case with concurrent_vector. Code runs something like this:
glVertexPointer(3, GL_FLOAT, 0, &vectorWithVerticesData[0]);
To generate concurrent_vector and copy it to std::vector is too expensive since there are lot of items.
So, the question is: I'd like to use OpenGL arrays, but also like to use concurrent_vector which is incompatible with OpenGL output.
Any suggestions?
You're trying to use a data structure that doesn't store its elements contiguously in an API that requires contiguous storage. Well, one of those has to give, and it's not going to be OpenGL. GL isn't going to walk concurrent_vector's data structure (not if you like performance).
So your option is to not use non-sequential objects.
I can only guess at what you're doing (since you didn't provide example code for the generator), so that limits what I can advise. If your parallel_for iterates for a fixed number of times (by "fixed", I mean a value that is known immediately before parallel_for executes. It doesn't change based on how many times you've iterated), then you can just use a regular vector.
Simply size the vector with vector::size. This will value-initialize the elements, which means that every element exists. You can now perform your parallel_for loop, but instead of using push_back or whatever, you simply copy the element directly into its location in the output. I think parallel_for can iterate over the actual vector iterators, but I'm not positive. Either way, it doesn't matter; you won't get any race conditions unless you try to set the same element from different threads.

Resources