Does ND4J slicing make a copy of the original array? - slice

ND4J INDArray slicing is achieved through one of the overloaded get() methods as answered in java - Get an arbitrary slice of a Nd4j array - Stack Overflow. As an INDArray takes a continuous block of native memory, does slicing using get() make a copy of the original memory (especially row slicing, in which it is possible create a new INDArray with the same backing memory)?
I have found another INDArray method subArray(). Does this one make any difference?
I am asking this because I am trying to create a DatasetIterator that can directly extract data from INDArrays, and I want to eliminate possible overhead. There is too much abstraction in the source code and I couldn't find the implementation myself.
A similar question about NumPy is asked in python - Numpy: views vs copy by slicing - Stack Overflow, and the answer can be found in Indexing — NumPy v1.16 Manual:
The rule of thumb here can be: in the context of lvalue indexing (i.e. the indices are placed in the left hand side value of an assignment), no view or copy of the array is created (because there is no need to). However, with regular values, the above rules for creating views does apply.

The short answer is: no it is using reference when possible. To make a copy the .dup() function can be called.
To quote https://deeplearning4j.org/docs/latest/nd4j-overview
Views: When Two or More NDArrays Refer to the Same Data
A key concept in ND4J is the fact that two NDArrays can actually point to the same
underlying data in memory. Usually, we have one NDArray referring to
some subset of another array, and this only occurs for certain
operations (such as INDArray.get(), INDArray.transpose(),
INDArray.getRow() etc. This is a powerful concept, and one that is
worth understanding.
There are two primary motivations for this:
There are considerable performance benefits, most notably in avoiding
copying arrays We gain a lot of power in terms of how we can perform
operations on our NDArrays Consider a simple operation like a matrix
transpose on a large (10,000 x 10,000) matrix. Using views, we can
perform this matrix transpose in constant time without performing any
copies (i.e., O(1) in big O notation), avoiding the considerable cost
copying all of the array elements. Of course, sometimes we do want to
make a copy - at which point we can use the INDArray.dup() to get a
copy. For example, to get a copy of a transposed matrix, use INDArray
out = myMatrix.transpose().dup(). After this dup() call, there will be
no link between the original array myMatrix and the array out (thus,
changes to one will not impact the other).

Related

best data structure for range delete

I have a stream of chars that I need to keep in a big data structure (can contain billions of chars)
I need to be able to:
store these chars quickly.
get all the chars quickly in order to print them for example
Delete a range of chars without leaving any gaps in the memory.
my first thought was double linked list , but the problem is that is taking to long to get to the middle of the list (begnining of the range)in order to delete.
to solve that I was thinking about a skip list which will make the search of this range faster but then I'm facing the problem of having to re-index each node after deletion
([0,1,2,3,4,5,6,7]
=> delete (3,4)
=> [0,1,2,5,6,7]
=> delete (3,4)
=> [0,1,2,7]
in this example after the first delete I need to give numbers 5,6,7 new indexes )
what is the best way to do this ?
It might be helpful to read about the span<T> data structure.
Related Answer: What is a "span" and when should I use one?
A span<T> is:
A very lightweight abstraction of a contiguous sequence of values of type T somewhere in memory.
Basically a struct { T * ptr; std::size_t length; } with a bunch of convenience methods.
A non-owning type (i.e. a "reference-type" rather than a "value type"): It never allocates nor deallocates anything and does not keep
smart pointers alive.
I would add that if you are processing a stream of characters, you will probably want to use buffering (or perhaps more apt - "chunking") where each chunk is itself a span<char> of fixed-size (which are all stored in a separate bit of memory) but tracked in a central array (or a more complex data structure like a double-linked-list, to facilitate quick deletion)
It would be an anti-pattern to attempt to actually maintain your entire stream of data in a single piece of contiguous physical memory (which you seem to suggest in part 3 of your request) - especially if you plan on deleting chunks of it. There should other ways to facilitate fast deletion without sacrificing performance elsewhere.
For example if you wish to delete a range of characters that falls into a given span, you can create two new spans from the start and end of the original span, excluding the deleted characters, and then replace the original span instance in your larger data structure (e.g if it were a double-linked list) with the two new smaller spans. None of this requires copying the underlying data itself, just slicing up our lightweight references to the underlying data.
If your language of choice doesn't support span, or a similar structure, check out how span is implemented.
Depending on your language of choice, it may even have built-in support for streaming spans (as .NET Core 2.1+ (2018) does).
Any additional requirements (such fast indexing to any point in your data stream, net of any deletions) can be satisfied by maintaining separate data structures that carry metadata about your spans (such as the suggested linked list). They will need updating when spans are deleted or added to, but because spans are a thin layer on top of large strings of characters, they reduce the cardinality of data structures you are maintaining by several orders of magnitude, so while you could get fancy with maintaining a variety of heaps and maps to facilitate O(1) algorithms for every operation, you will probably find that basic structures and O(log(n)) or even O(N) (where N is actually N/chunk-size) maintenance operations are feasible.

FastTextKeyedVectors difference between vectors, vectors_vocab and vectors_ngrams instance variables

I downloaded wiki-news-300d-1M-subword.bin.zip and loaded it as follows:
import gensim
print(gensim.__version__)
model = gensim.models.fasttext.load_facebook_model('./wiki-news-300d-1M-subword.bin')
print(type(model))
model_keyedvectors = model.wv
print(type(model_keyedvectors))
model_keyedvectors.save('./wiki-news-300d-1M-subword.keyedvectors')
As expected, I see the following output:
3.8.1
<class 'gensim.models.fasttext.FastText'>
<class 'gensim.models.keyedvectors.FastTextKeyedVectors'>
I also see the following three numpy arrays serialized to the disk:
$ du -h wiki-news-300d-1M-subword.keyedvectors*
127M wiki-news-300d-1M-subword.keyedvectors
2.3G wiki-news-300d-1M-subword.keyedvectors.vectors_ngrams.npy
2.3G wiki-news-300d-1M-subword.keyedvectors.vectors.npy
2.3G wiki-news-300d-1M-subword.keyedvectors.vectors_vocab.npy
I understand vectors_vocab.npy and vectors_ngrams.npy, however, what is vectors.npy is used for internally in gensim.models.keyedvectors.FastTextKeyedVectors? If I look at the source code for finding out word vector, I do not see how attribute vectors is being used anywhere. I see the attributes vectors_vocab and vectors_ngrams bing used. However, if I remove vectors.npy file, I am not able to load the model using gensim.models.keyedvectors.FastTextKeyedVectors.load method.
Can someone please explain where this variable is used? Can I remove it if all I am interested is in looking word vectors (to reduce memory footprint)?
Thanks.
vectors_ngrams are the buckets storing the vectors that are learned from word-fragments (character-n-grams). It's a fixed size no matter how many n-grams are encountered - as multiple n-grams can 'collide' into the same slot.
vectors_vocab are the full-word-token vectors as trained by the FastText algorithm, for full-words of interest. However, note that the actual word-vector, as returned by FastText for an in-vocabulary word, is defined as being this vector plus all the subword vectors.
vectors stores the actual, returnable full-word vectors for in-vocabulary words. That is: it's the precalculated combination of the vectors_vocab value plus all the word's n-gram vectors.
So, vectors is never directly trained, and can always be recalculated from the other arrays. It probably should not be stored as part of the saved model (as it's redundant info that could be reconstructed on demand).
(It could possibly even be made an optional optimization, for the specific case of FastText – with users who are willing to save memory, but have slower per-word lookup, discarding it. However, this would complicate the very common and important most_similar()-like operations, which are far more efficient if they have a full, ready array of all potential-answer word-vectors.)
If you don't see vectors being directly accessed, perhaps you're not considering methods inherited from superclasses.
While any model that was saved with vectors present will need that file when later .load()ed, you could conceivably save on disk-storage by discarding the model.wv.vectors property before saving, then forcing its reconstruction after loading. You would still be paying the RAM cost, when the model is loaded.
After vectors is calculated, and if you're completely done training, you could conceivably discard the vectors_vocab property to save RAM. (For any known word, the vectors can be consulted directly for instant look-up, and vectors_vocab is only needed in the case of further training or needing to re-generate vectors.)

Why are we using linked list to address collisions in hash tables?

I was wondering why many languages (Java, C++, Python, Perl etc) implement hash tables using linked lists to avoid collisions instead of arrays?
I mean instead of buckets of linked lists, we should use arrays.
If the concern is about the size of the array then that means that we have too many collisions so we already have a problem with the hash function and not the way we address collisions. Am I misunderstanding something?
I mean instead of buckets of linked lists, we should use arrays.
Pros and cons to everything, depending on many factors.
The two biggest problem with arrays:
changing capacity involves copying all content to another memory area
you have to choose between:
a) arrays of Element*s, adding one extra indirection during table operations, and one extra memory allocation per non-empty bucket with associated heap management overheads
b) arrays of Elements, such that the pre-existing Elements iterators/pointers/references are invalidated by some operations on other nodes (e.g. insert) (the linked list approach - or 2a above for that matter - needn't invalidate these)
...will ignore several smaller design choices about indirection with arrays...
Practical ways to reduce copying from 1. include keeping excess capacity (i.e. currently unused memory for anticipated or already-erased elements), and - if sizeof(Element) is much greater than sizeof(Element*) - you're pushed towards arrays-of-Element*s (with "2a" problems) rather than Element[]s/2b.
There are a couple other answers claiming erasing in arrays is more expensive than for linked lists, but the opposite's often true: searching contiguous Elements is faster than scanning a linked list (less steps in code, more cache friendly), and once found you can copy the last array Element or Element* over the one being erased then decrement size.
If the concern is about the size of the array then that means that we have too many collisions so we already have a problem with the hash function and not the way we address collisions. Am I misunderstanding something?
To answer that, let's look at what happens with a great hash function. Packing a million elements into a million buckets using a cryptographic strength hash, a few runs of my program counting the number of buckets to which 0, 1, 2 etc. elements hashed yielded...
0=367790 1=367843 2=184192 3=61200 4=15370 5=3035 6=486 7=71 8=11 9=2
0=367664 1=367788 2=184377 3=61424 4=15231 5=2933 6=497 7=75 8=10 10=1
0=367717 1=368151 2=183837 3=61328 4=15300 5=3104 6=486 7=64 8=10 9=3
If we increase that to 100 million elements - still with load factor 1.0:
0=36787653 1=36788486 2=18394273 3=6130573 4=1532728 5=306937 6=51005 7=7264 8=968 9=101 10=11 11=1
We can see the ratios are pretty stable. Even with load factor 1.0 (the default maximum for C++'s unordered_set and -map), 36.8% of buckets can be expected to be empty, another 36.8% handling one Element, 18.4% 2 Elements and so on. For any given array resizing logic you can easily get a sense of how often it will need to resize (and potentially copy elements). You're right that it doesn't look bad, and may be better than linked lists if you're doing lots of lookups or iterations, for this idealistic cryptographic-hash case.
But, good quality hashing is relatively expensive in CPU time, such that general purpose hash-table supporting hash functions are often very weak: e.g. it's very common for C++ Standard library implementations of std::hash<int> to return their argument, and MS Visual C++'s std::hash<std::string> picks 10 characters evently spaced along the string to incorporate in the hash value, regardless of how long the string is.
Clearly implementation's experience has been that this combination of weak-but-fast hash functions and linked lists (or trees) to handle the greater collision proneness works out faster on average - and has less user-antagonising manifestations of obnoxiously bad performance - for everyday keys and requirements.
Strategy 1
Use (small) arrays which get instantiated and subsequently filled once collisions occur. 1 heap operation for the allocation of the array, then room for N-1 more. If no collision ever occurs again for that bucket, N-1 capacity for entries is wasted. List wins, if collisions are rare, no excess memory is allocated just for the probability of having more overflows on a bucket. Removing items is also more expensive. Either mark deleted spots in the array or move the stuff behind it to the front. And what if the array is full? Linked list of arrays or resize the array?
One potential benefit of using arrays would be to do a sorted insert and then binary search upon retrieval. The linked list approach cannot compete with that. But whether or not that pays off depends on the write/retrieve ratio. The less frequently writing occurs, the more could this pay off.
Strategy 2
Use lists. You pay for what you get. 1 collision = 1 heap operation. No eager assumption (and price to pay in terms of memory) that "more will come". Linear search within the collision lists. Cheaper delete. (Not counting free() here). One major motivation to think of arrays instead of lists would be to reduce the amount of heap operations. Amusingly the general assumption seems to be that they are cheap. But not many will actually know how much time an allocation requires compared to, say traversing the list looking for a match.
Strategy 3
Use neither array nor lists but store the overflow entries within the hash table at another location. Last time I mentioned that here, I got frowned upon a bit. Benefit: 0 memory allocations. Probably works best if you have indeed low fill grade of the table and only few collisions.
Summary
There are indeed many options and trade-offs to choose from. Generic hash table implementations such as those in standard libraries cannot make any assumption regarding write/read ratio, quality of hash key, use cases, etc. If, on the other hand all those traits of a hash table application are known (and if it is worth the effort), it is well possible to create an optimized implementation of a hash table which is tailored for the set of trade offs the application requires.
The reason is, that the expected length of these lists is tiny, with only zero, one, or two entries in the vast majority of cases. Yet these lists may also become arbitrarily long in the worst case of a really bad hash function. And even though this worst case is not the case that hash tables are optimized for, they still need to be able to handle it gracefully.
Now, for an array based approach, you would need to set a minimal array size. And, if that initial array size is anything other then zero, you already have significant space overhead due to all the empty lists. A minimal array size of two would mean that you waste half your space. And you would need to implement logic to reallocate the arrays when they become full because you cannot put an upper limit to the list length, you need to be able to handle the worst case.
The list based approach is much more efficient under these constraints: It has only the allocation overhead for the node objects, most accesses have the same amount of indirection as the array based approach, and it's easier to write.
I'm not saying that it's impossible to write an array based implementation, but its significantly more complex and less efficient than the list based approach.
why many languages (Java, C++, Python, Perl etc) implement hash tables using linked lists to avoid collisions instead of arrays?
I'm almost sure, at least for most from that "many" languages:
Original implementors of hash tables for these languages just followed classic algorithm description from Knuth/other algorithmic book, and didn't even consider such subtle implementation choices.
Some observations:
Even using collision resolution with separate chains instead of, say, open addressing, for "most generic hash table implementation" is seriously doubtful choice. My personal conviction -- it is not the right choice.
When hash table's load factor is pretty low (that should chosen in nearly 99% hash table usages), the difference between the suggested approaches hardly could affect overall data structure perfromance (as cmaster explained in the beginning of his answer, and delnan meaningfully refined in the comments). Since generic hash table implementations in languages are not designed for high density, "linked lists vs arrays" is not a pressing issue for them.
Returning to the topic question itself, I don't see any conceptual reason why linked lists should be better than arrays. I can easily imagine, that, in fact, arrays are faster on modern hardware / consume less memory with modern momory allocators inside modern language runtimes / operating systems. Especially when the hash table's key is primitive, or a copied structure. You can find some arguments backing this opinion here: http://en.wikipedia.org/wiki/Hash_table#Separate_chaining_with_other_structures
But the only way to find the correct answer (for particular CPU, OS, memory allocator, virtual machine and it's garbage collection algorithm, and the hash table use case / workload!) is to implement both approaches and compare them.
Am I misunderstanding something?
No, you don't misunderstand anything, your question is legal. It's an example of fair confusion, when something is done in some specific way not for a strong reason, but, largely, by occasion.
If is implemented using arrays, in case of insertion it will be costly due to reallocation which in case of linked list doesn`t happen.
Coming to the case of deletion we have to search the complete array then either mark it as delete or move the remaining elements. (in the former case it makes the insertion even more difficult as we have to search for empty slots).
To improve the worst case time complexity from o(n) to o(logn), once the number of items in a hash bucket grows beyond a certain threshold, that bucket will switch from using a linked list of entries to a balanced tree (in java).

Optimizing Inserting into the Middle of a List

I have algorithms that works with dynamically growing lists (contiguous memory like a C++ vector, Java ArrayList or C# List). Until recently, these algorithms would insert new values into the middle of the lists. Of course, this was usually a very slow operation. Every time an item was added, all the items after it needed to be shifted to a higher index. Do this a few times for each algorithm and things get really slow.
My realization was that I could add the new items to the end of the list and then rotate them into position later. That's one option!
Another option, when I know how many items I'm adding ahead of time, is to add that many items to the back, shift the existing items and then perform the algorithm in-place in the hole I've made for myself. The negative is that I have to add some default value to the end of the list and then just overwrite them.
I did a quick analysis of these options and concluded that the second option is more efficient. My reasoning was that the rotation with the first option would result in in-place swaps (requiring a temporary). My only concern with the second option is that I am creating a bunch of default values that just get thrown away. Most of the time, these default values will be null or a mem-filled value type.
However, I'd like someone else familiar with algorithms to tell me which approach would be faster. Or, perhaps there's an even more efficient solution I haven't considered.
Arrays aren't efficient for lots of insertions or deletions into anywhere other than the end of the array. Consider whether using a different data structure (such as one suggested in one of the other answers) may be more efficient. Without knowing the problem you're trying to solve, it's near-impossible to suggest a data structure (there's no one solution for all problems). That being said...
The second option is definitely the better option of the two. A somewhat better option (avoiding the default-value issue): simply copy 789 to the end and overwrite the middle 789 with 456. So the only intermediate step would be 0123789789.
Your default-value concern is, however, (generally) not a big issue:
In Java, for one, you cannot (to my knowledge) even assign memory for an array that's not 0- or null-filled. C++ STL containers also enforce this I believe (but not C++ itself).
The size of a pointer compared to any moderate-sized class is minimal (thus assigning it to a default value also takes minimal time) (in Java and C# everything is pointers, in C++ you can use pointers (something like boost::shared_ptr or a pointer-vector is preferred above straight pointers) (N/A to primitives, which are small to start, so generally not really a big issue either).
I'd also suggest forcing a reallocation to a specified size before you start inserting to the end of the array (Java's ArrayList::ensureCapacity or C++'s vector::reserve). In case you didn't know - varying-length-array implementations tend to have an internal array that's bigger than what size() returns or what's accessible (in order to prevent constant reallocation of memory as you insert or delete values).
Also note that there are more efficient methods to copy parts of an array than doing it manually with for loops (e.g. Java's System.arraycopy).
You might want to consider changing your representation of the list from using a dynamic array to using some other structure. Here are two options that allow you to implement these operations efficiently:
An order statistic tree is a modified type of binary tree that supports insertions and selections anywhere in O(log n) time, as well as lookups in O(log n) time. This will increase your memory usage quite a bit because of the overhead for the pointers and extra bookkeeping, but should dramatically speed up insertions. However, it will slow down lookups a bit.
If you always know the insertion point in advance, you could consider switching to a linked list instead of an array, and just keep a pointer to the linked list cell where insertions will occur. However, this slows down random access to O(n), which could possibly be an issue in your setup.
Alternatively, if you always know where insertions will happen, you could consider representing your array as two stacks - one stack holding the contents of the array to the left of the insert point and one holding the (reverse) of the elements to the right of the insertion point. This makes insertions fast, and if you have the right type of stack implementation could keep random access fast.
Hope this helps!
HashMaps and Linked Lists were designed for the problem you are having. Given a indexed data structure with numbered items, the difficulty of inserting items in the middle requires a renumbering of every item in the list.
You need a data structure which is optimized to make inserts a constant O(1) complexity. HashMaps were designed to make insert and delete operations lightning quick regardless of dataset size.
I can't pretend to do the HashMap subject justice by describing it. Here is a good intro: http://en.wikipedia.org/wiki/Hash_table

Optimized "Multidimensional" Arrays in Ruby

From birth I've always been taught to avoid nested arrays like the plague for performance and internal data structure reasons. So I'm trying to find a good solution for optimized multidimensional data structures in Ruby.
The typical solution would involve maybe using a 1D array and accessing each one by x*width + y.
Ruby has the ability to overload the [] operator, so perhaps a good solution would involve using multi_dimensional_array[2,4] or even use a splat to support arbitrary dimension amounts. (But really, I only need two dimensions)
Is there a library/gem already out there for this? If not, what would the best way to go about writing this be?
My nested-array-lookups are the bottleneck right now of my rather computationally-intensive script, so this is something that is important and not a case of premature optimization.
If it helps, my script uses mostly random lookups and less traversals.
narray
NArray is an Numerical N-dimensional
Array class. Supported element types
are 1/2/4-byte Integer,
single/double-precision Real/Complex,
and Ruby Object. This extension
library incorporates fast calculation
and easy manipulation of large
numerical arrays into the Ruby
language. NArray has features similar
to NumPy, but NArray has vector and
matrix subclasses.
You could inherit from Array and create your own class that emulated a multi-dimensional array (but was internally a simple 1-dimensional array). You may see some speedup from it, but it's hard to say without writing the code both ways and profiling it.
You may also want to experiment with the NArray class.
All that aside, your nested array lookups might not be the real bottleneck that they appear to be. On several occasions, I have had the same problem and then later found out that re-writing some of my logic cleared up the bottleneck. It's more than just speeding up the nested lookups, it's about minimizing the number of lookups needed. Each "random access" in an n-dimensional array takes n lookups (one per nested array level). You can reduce this by iterating through the dimensions using code like:
array.each {|x|
x.each {|y|
y.each {|z|
...
}
}
}
This allows you to do a single lookup in the first dimension and then access everything "behind" it, then a single lookup in the second dimension, etc etc. This will result in significantly fewer lookups than randomly accessing elements.
If you need random element access, you may want to try using a hash instead. You can take the array indices, concatenate them together as a string, and use that as the hash key (for example, array[12][0][3] becomes hash['0012_0000_0003']). This may result in faster "random access" times, but you'd want to profile it to know for certain.
Any chance you can post some of your problematic code? Knowing the problem code will make it easier for us to recommend a solution.
nested arrays aren't that bad if you traverse them properly this means first traverse rows and then travers through columns. This should be quite fast. If you need a certain element often you should store the value in a variable. Otherwise you're jumping around in the memory and this leads to a bad performance.
Big Rule: Don't jump around in your nested array try to traverse it linear from row to row.

Resources