I have a problem where I look at a row of elements, and if there are no non-zero elements in the row, I want to set one to a random value. My difficulty is the update strategy. I just attempted to get a version working that used slice from Clatrix, but that does not work with Incater matrices. So should I construct another matrix of "modifications" and perform elementwise addition? Or build some other form of matrix and perform a multiplication (not sure how to build this one yet). Or maybe I could somehow map the rows? My trouble there is that assoc does not work Incanter matrices, which is apparently what sel returns.
It seems that there are no easy ways to do this in Incanter, so I ended up directly using set from Clatrix and avoiding Incanter usage.
Related
ND4J INDArray slicing is achieved through one of the overloaded get() methods as answered in java - Get an arbitrary slice of a Nd4j array - Stack Overflow. As an INDArray takes a continuous block of native memory, does slicing using get() make a copy of the original memory (especially row slicing, in which it is possible create a new INDArray with the same backing memory)?
I have found another INDArray method subArray(). Does this one make any difference?
I am asking this because I am trying to create a DatasetIterator that can directly extract data from INDArrays, and I want to eliminate possible overhead. There is too much abstraction in the source code and I couldn't find the implementation myself.
A similar question about NumPy is asked in python - Numpy: views vs copy by slicing - Stack Overflow, and the answer can be found in Indexing — NumPy v1.16 Manual:
The rule of thumb here can be: in the context of lvalue indexing (i.e. the indices are placed in the left hand side value of an assignment), no view or copy of the array is created (because there is no need to). However, with regular values, the above rules for creating views does apply.
The short answer is: no it is using reference when possible. To make a copy the .dup() function can be called.
To quote https://deeplearning4j.org/docs/latest/nd4j-overview
Views: When Two or More NDArrays Refer to the Same Data
A key concept in ND4J is the fact that two NDArrays can actually point to the same
underlying data in memory. Usually, we have one NDArray referring to
some subset of another array, and this only occurs for certain
operations (such as INDArray.get(), INDArray.transpose(),
INDArray.getRow() etc. This is a powerful concept, and one that is
worth understanding.
There are two primary motivations for this:
There are considerable performance benefits, most notably in avoiding
copying arrays We gain a lot of power in terms of how we can perform
operations on our NDArrays Consider a simple operation like a matrix
transpose on a large (10,000 x 10,000) matrix. Using views, we can
perform this matrix transpose in constant time without performing any
copies (i.e., O(1) in big O notation), avoiding the considerable cost
copying all of the array elements. Of course, sometimes we do want to
make a copy - at which point we can use the INDArray.dup() to get a
copy. For example, to get a copy of a transposed matrix, use INDArray
out = myMatrix.transpose().dup(). After this dup() call, there will be
no link between the original array myMatrix and the array out (thus,
changes to one will not impact the other).
There is a particular class of algorithm coding problems which require us to evaluate multiple queries which can be of two kind :
Perform search over a range of data
Update the data over a given range
One example which I've been recently working on is this(though not the only one) : Quadrant Queries
Now, to optimize my algorithm, I have had one idea :
I can use dynamic programming to keep the search results for a particular range, and generate data for other ranges as required.
For example, if I have to calculate sum of numbers in an array from index 4 to 7, I can already keep sum of elements upto 4 and sum of elements upto 7 which is easy and then I'll just need the difference of the two + 4th element which is O(1). But this raises another problem : During the update operation, I'll have to update my stored search data for all the elements following the updated element. This seems to be inefficient, though I did not try it practically.
Someone suggested me that I can combine subsequent update operations using some special data structure.(Actually read it on some forum).
Question: Is there a known way to optimize these kind of problems? Is there a special data structure that does it? The idea I mentioned;Is it possible that it might be more efficient than direct approach? Should I try it out?
It might help:
Segment Trees (Range-Range part)
I have a set of samples (vectors) each have a dimension about of M (10000) and the size of the set is also about N(10000), and i want to find first (with biggest eiegenvalues) 10 PC of this set. Due to the big dimension of samples i cannot calculate covariation matrix in reasonable time. Are there any methods to select PC without calculation of full cov matrix or methods that can effectively handle big dimension of data or something like this? So these methods should require less operations than O(M*M*N).
NIPALS -- Non-linear iterative partial least squares
see for example here: http://en.wikipedia.org/wiki/NIPALS
guys, maybe it could help somehow, i have found solution in family of EM-PCA methods (see for example this, http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/papers/PCA_RoweisEMPCA.pdf)
From birth I've always been taught to avoid nested arrays like the plague for performance and internal data structure reasons. So I'm trying to find a good solution for optimized multidimensional data structures in Ruby.
The typical solution would involve maybe using a 1D array and accessing each one by x*width + y.
Ruby has the ability to overload the [] operator, so perhaps a good solution would involve using multi_dimensional_array[2,4] or even use a splat to support arbitrary dimension amounts. (But really, I only need two dimensions)
Is there a library/gem already out there for this? If not, what would the best way to go about writing this be?
My nested-array-lookups are the bottleneck right now of my rather computationally-intensive script, so this is something that is important and not a case of premature optimization.
If it helps, my script uses mostly random lookups and less traversals.
narray
NArray is an Numerical N-dimensional
Array class. Supported element types
are 1/2/4-byte Integer,
single/double-precision Real/Complex,
and Ruby Object. This extension
library incorporates fast calculation
and easy manipulation of large
numerical arrays into the Ruby
language. NArray has features similar
to NumPy, but NArray has vector and
matrix subclasses.
You could inherit from Array and create your own class that emulated a multi-dimensional array (but was internally a simple 1-dimensional array). You may see some speedup from it, but it's hard to say without writing the code both ways and profiling it.
You may also want to experiment with the NArray class.
All that aside, your nested array lookups might not be the real bottleneck that they appear to be. On several occasions, I have had the same problem and then later found out that re-writing some of my logic cleared up the bottleneck. It's more than just speeding up the nested lookups, it's about minimizing the number of lookups needed. Each "random access" in an n-dimensional array takes n lookups (one per nested array level). You can reduce this by iterating through the dimensions using code like:
array.each {|x|
x.each {|y|
y.each {|z|
...
}
}
}
This allows you to do a single lookup in the first dimension and then access everything "behind" it, then a single lookup in the second dimension, etc etc. This will result in significantly fewer lookups than randomly accessing elements.
If you need random element access, you may want to try using a hash instead. You can take the array indices, concatenate them together as a string, and use that as the hash key (for example, array[12][0][3] becomes hash['0012_0000_0003']). This may result in faster "random access" times, but you'd want to profile it to know for certain.
Any chance you can post some of your problematic code? Knowing the problem code will make it easier for us to recommend a solution.
nested arrays aren't that bad if you traverse them properly this means first traverse rows and then travers through columns. This should be quite fast. If you need a certain element often you should store the value in a variable. Otherwise you're jumping around in the memory and this leads to a bad performance.
Big Rule: Don't jump around in your nested array try to traverse it linear from row to row.
I would like to quickly retrieve the median value from a boost multi_index container with an ordered_unique index, however the index iterators aren't random access (I don't understand why they can't be, though this is consistent with std::set...).
Is there a faster/neater way to do this other than incrementing an iterator container.size() / 2 times?
Boost.MultiIndex provide random access indexes, but these index don't take care directly of any order. You can however sort these index, using the sort member function, after inserting a new element, so you will be able to get the median efficiently.
It seems you should make a request to Boost.MultiIndex so the insertion can be done using an order directly, as this should be much more efficient.
I ran into the same problem in a different context. It seems that the STL and Boost don't provide an ordered container that has random access to make use of the ordering (e.g. for comparing).
My (not so pretty) solution was to use a Class that performed the input and "filtered" it in a set. After the input operation was finished it just copied all iterators of the set to a vector and used this for random access.
This solution only works in a very limited context: You perform input on the container once. If you change add to the container again all iterators would have to be copied again. It really was very clumsy to use but worked.