Weak-memoizing result of multi-parameter function in OCaml - caching

I am looking for a way to memoize the results of an OCaml function f that takes two parameters (or more, in general). In addition (and this is the difficult part), I want the map underlying this process to forget a result altogether if either of the values for the two parameters is garbage collected.
For a function that takes exactly one argument, this can be done with the Weak module and its Make functor in a straightforward way. To generalise this to something that can memoize functions of higher arity, a naive solution is to create a weak map from tuples of values to result values. But this will not work correctly with respect to garbage collection, as the tuple of values only exists within the scope of the memoization function, not the client code that calls f. In fact, the weak reference will be to the tuple, which is going to be garbage collected right after memoization (in the worst case).
Is there a way to do this without re-implementing Weak.Make?
Hash-consing is orthogonal to my requirements and is, in fact, not really desirable for my values.
Thanks!

Instead of indexing by tuples you could have a tree structure. You'd have one weak table indexed by the first function parameter whose entries are secondary weak tables. The secondary tables would be indexed by the second function parameter and contain the memoized results. This structure will forget the memoized function results as soon as either function parameter is GCed. However, the secondary tables themselves will be retained as long as the first function parameter is live. Depending on the sizes of your function results and the distribution of different first parameters, this could be a reasonable tradeoff.
I haven't tested this, either. Also it seems reasonably obvious.

One idea is to perform your own garbage collection.
For simplicity, let's assume that all arguments have the same type k.
In addition to the main weak table containing the memoized results keyed by k * k, create a secondary weak table containing single arguments of type k. The idea is to scan the main table once in a while and to remove the bindings that are no longer wanted. This is done by looking up the arguments in the secondary table; then if any of them is gone you remove the binding from the main table.
(Disclaimer: I haven't tested this; it may not work or there may be better solutions)

I know this is an old question, but my colleagues have recently developed an incremental computation library, called Adapton, that can handle this functionality. You can find the code here. You probably want to use the LazySABidi functor (the others are for benchmarking). You can look in the Applications folder for examples of how to use the library. Let me know if you have any more questions.

Related

Does ND4J slicing make a copy of the original array?

ND4J INDArray slicing is achieved through one of the overloaded get() methods as answered in java - Get an arbitrary slice of a Nd4j array - Stack Overflow. As an INDArray takes a continuous block of native memory, does slicing using get() make a copy of the original memory (especially row slicing, in which it is possible create a new INDArray with the same backing memory)?
I have found another INDArray method subArray(). Does this one make any difference?
I am asking this because I am trying to create a DatasetIterator that can directly extract data from INDArrays, and I want to eliminate possible overhead. There is too much abstraction in the source code and I couldn't find the implementation myself.
A similar question about NumPy is asked in python - Numpy: views vs copy by slicing - Stack Overflow, and the answer can be found in Indexing — NumPy v1.16 Manual:
The rule of thumb here can be: in the context of lvalue indexing (i.e. the indices are placed in the left hand side value of an assignment), no view or copy of the array is created (because there is no need to). However, with regular values, the above rules for creating views does apply.
The short answer is: no it is using reference when possible. To make a copy the .dup() function can be called.
To quote https://deeplearning4j.org/docs/latest/nd4j-overview
Views: When Two or More NDArrays Refer to the Same Data
A key concept in ND4J is the fact that two NDArrays can actually point to the same
underlying data in memory. Usually, we have one NDArray referring to
some subset of another array, and this only occurs for certain
operations (such as INDArray.get(), INDArray.transpose(),
INDArray.getRow() etc. This is a powerful concept, and one that is
worth understanding.
There are two primary motivations for this:
There are considerable performance benefits, most notably in avoiding
copying arrays We gain a lot of power in terms of how we can perform
operations on our NDArrays Consider a simple operation like a matrix
transpose on a large (10,000 x 10,000) matrix. Using views, we can
perform this matrix transpose in constant time without performing any
copies (i.e., O(1) in big O notation), avoiding the considerable cost
copying all of the array elements. Of course, sometimes we do want to
make a copy - at which point we can use the INDArray.dup() to get a
copy. For example, to get a copy of a transposed matrix, use INDArray
out = myMatrix.transpose().dup(). After this dup() call, there will be
no link between the original array myMatrix and the array out (thus,
changes to one will not impact the other).

What are the "new hash functions" in cuckoo hashing?

I am reading about cuckoo hashing from Pagh and Rodle and I can't understand the meaning of this paragraph:
It may happen that this process loops, as shown in Fig. 1(b).
Therefore the number of iterations is bounded by a value “MaxLoop” to
be specified in Section 2.3. If this number of iterations is reached,
we rehash the keys in the tables using new hash functions, and try
once again to accommodate the nestless key. There is no need to
allocate new tables for the rehashing: We may simply run through the
tables to delete and perform the usual insertion procedure on all keys
found not to be at their intended position in the table.
What does it mean by using new hash functions?
In the insert algorithm the table is resized. Are we supposed to have a "pool" of hash functions to use somehow? How do we create this pool?
Yes, they're expecting new hash functions, just like they say. Fortunately, they don't require a pile of new algorithms, just slightly different hashing behavior on your current data set.
Take a look at section 2.1 of the paper, and then Appendix A. It discusses the construction of random universal hash functions.
A simple, hopefully illustrative example, is to suppose you've got some normal hash function you like, that operates on blocks of bytes. We'll call it H(x). You want to use it to produce a family of new, slightly different hash functions H_n(x). Well, if H(x) is good, and your requirements are weak, you can just define H_n(x) = H(concat(n,x)). You don't have nice strong guarantees about the behaviors of H_n(x), but you'd expect most of them to be different.

Basics in Universal Hashing, how to ensure accessibility

to my current understanding Universal Hashing is a method whereby the hash function is chosen randomly at runtime in order to guarantee reasonable performance for any kind of input.
I understand we may do this in order to prevent manipulation by somebody choosing malicious input deliberately (a possibility of a deterministic hash function is know).
My Question is the following: Is it not true, that we still need to guarantee that a key will be mapped to the same address every time we hash it ? For instance if we want to retrieve information, but the hash function is chosen at random, how do we guarantee we can get back at our data ?
A universal hash function is a family of different hash functions that have the property that with high probability, two randomly-chosen elements from the universe will not collide no matter which hash function is chosen. Typically, this is implemented by having the implementation pick a random hash function from a family of hash functions to use inside the implementation. Once this hash function is chosen, the hash table works as usual - you use this hash function to compute a hash code for an object, then put the object into the appropriate location. The hash table has to remember the choice of the hash function it made and has to use it consistently throughout the program, since otherwise (as you've noted) it would forget where it mapped each element.
Hope this helps!

What's a simple way to design a memoization system with limited memory?

I am writing a manual computation memoization system (ugh, in Matlab). The straightforward parts are easy:
A way to put data into the memoization system after performing a computation.
A way to query and get data out of the memoization.
A way to query the system for all the 'keys'.
These parts are not so much in doubt. The problem is that my computer has a finite amount of memory, so sometime the 'put' operation will have to dump some objects out of memory. I am worried about 'cache misses', so I would like some relatively simple system for dropping the memoized objects which are not used frequently and/or are not costly to recompute. How do I design that system? The parts I can imagine it having:
A way to tell the 'put' operation how costly (relatively speaking) the computation was.
A way to optionally hint to the 'put' operation how often the computation might be needed (going forward).
The 'get' operation would want to note how often a given object is queried.
A way to tell the whole system the maximum amount of memory to use (ok, that's simple).
The real guts of it would be in the 'put' operation when you hit the memory limit and it has to cull some objects, based on their memory footprint, their costliness, and their usefulness. How do I do that?
Sorry if this is too vague, or off-topic.
I'd do it by creating a subclass to DYNAMICPROPS that uses a cell array to store the data internally. This way, you can dynamically add more data to the object.
Here's the basic design idea:
The data is stored in a cell array. Each property gets its own row, with the first column being the property name (for convenience), the second column a function handle to calculate the data, the third column the data, the fourth column the time it took to generate the data, the fifth column an array of, say, length 100 storing the timestamps corresponding to when the property was accessed the last 100 times, and the sixth column contains the variable size.
There is a generic get method that takes as input the row number corresponding to the property (see below). The get method first checks whether column 3 is empty. If no, it returns the value and stores the timestamp. If yes, it performs the computation using the handle from column 1 inside a TIC/TOC statement to measure how expensive the computation is (which is stored in col4, and the timestamp is stored in col5). Then it checks whether there is enough space for storing the result. If yes, it stores the data, otherwise it checks size, as well as the product of how many times data were accessed with how long it would take to regenerate, to decide what to cull.
In addition, there is an 'add' property, that adds a row to the cell array, creates a dynamic property (using addprops) of the same name as the function handle, and sets the get-method to myGetMethod(myPropertyIndex). If you need to pass parameters to the function, you can create an additional property myDynamicPropertyName_parameters with a set method that will remove previously calculated data whenever the parameters change value.
Finally, you can add a few dependent properties, that can tell how many properties there are (# of rows in the cell array), how they're called (first col of the cell array), etc.
Consider using Java, since MATLAB runs on top of it, and can access it. This would work if you have marshallable values (ints, doubles, strings, matrices, but not structs or cell arrays).
LRU containers are available for Java:
Easy, simple to use LRU cache in java
http://java-planet.blogspot.com/2005/08/how-to-set-up-simple-lru-cache-using.html
http://www.codeproject.com/KB/java/lru.aspx

Is there any performance difference between myCollection.Where(...).FirstOrDefault() and myCollection.FirstOrDefault(...)

Is there any performance difference between myCollection.Where(...).FirstOrDefault() and myCollection.FirstOrDefault(...)
Filling in the dots with the predicate you are using.
Assuming we're talking LinqToObjects (obviously LinqToSql, LinqToWhatever have their own rules), the first will be ever so slightly slower since a new iterator must be created, but it's incredibly unlikely that you would ever notice the difference. In terms of number of comparisons and number of items examined, the time the two take to run will be virtually identical.
In case you're worried, what will not happen is that the .Where operator filters the list to n items and the .FirstOfDefault takes the first out of the filtered list. Both sequences will short-circuit correctly
If we assume that in both cases you're using the Extension methods provided by the Enumerable static class, then you'll be hard pressed to measure any difference between the two.
The longer form ...
myCollection.Where(...).FirstOrDefault()
... will (technically) produce more memory activity (creating an intermediary iterator to handle the Where() clause) and involve a few more cycles of processing.
The thing is that these iterators are lazy - the Where() clause won't go merrily through the entire list evaluating the predicate, it will only check as many items as necessary to find one to pass through.

Resources