Related
I am currently trying to come up with a semi-decent (considering complexity, statistical properties and common sense) algorithm for sampling.
The data is currently contained inside a hash table, where each key is an item and the key's value is the item's frequency in the original distribution.
If one wanted to sample from such histogram, how would he go about doing that if he wanted to preserve the original probabilities of the items and transfer them into the sample?
Also, we require that there is a flag of whether duplicate items are allowed in the sample. In the case of not allowing the duplicates, the best I came up with is to apply the algorithm from the paragraph above and delete the item from the hash table once it is sampled. This way, at least the relative probabilities are preserved amongst the remaining items. However, I am unsure of whether this is an accepted practice statistically.
Is there a generally accepted algorithm for doing this? If it helps, we need to implement it in Common Lisp.
This is a part of the answer. It uses lists instead of hash table:
(defun random-item-with-prob (prob-item-pairs)
"The argument PROB-ITEM-PAIRS is ((p_1 item_1) (p_2 item_2) ...
(p_n item_n)). The function returns one of the items according to the
probabilities. "
(loop with p = (random 1.0)
with x = 0
for pair in prob-item-pairs
do
(if (< p (+ (first pair) x))
(return (second pair))
(incf x (first pair)))))
For the second part of your question: If you want to sample according to frequencies, this means that you care about the distribution of the data. Removing items (or not allowing duplicates) alters the distribution during the sampling procedure. If you really want to do that, you can repeat calls to the previous function, removing duplicates until you have the desired sample size.
I have an operation A * A -> A, which is commutative and associative. This means the order I apply it in doesn't matter, as long as I use the same elements. Nice.
I have to apply it to a list of values. To be more precise, I have to use it as the operation to accumulate the values of the list. So far, so good.
I then have a series of requests to add an element to the list, or erase it from the list. After each insertion or deletion, I have to return the new accumulated value for the new list. Simple, right?
The problem is I don't have an inverse; that is no operation '/' able to remove b if I only know a * b and tell me the other operand must have been a. (in fact, there isn't even an identity element)
So, my only obvious option is to accumulate again at every deletion -in linear time.
Can I do better? I've thought a lot about it.
And the answer is, of course I can... if I really want: I need to implement a custom binary tree, maybe a red/black one to have good worst case guarantees. Have next to the value an additional cache storing the result of the whole subtree.
cache = value * left.cache * right.cache
Maintain this invariant after every operation; then the root cache is the result.
However, "implement a custom R/B tree while maintaining an additional invariant" isn't something I'm particularly comfortable at doing. Well I would do it, but not swear by its correctness. Plus, the constant before the log would probably be significant. It seems pretty unwieldy, to do a simple thing like keeping track of an accumulation.
Does anyone see a better solution?
For completeness: the operation is a union of filters. A filter is a couple (code, mask), and a value "passes the filter" if (C bitwise operators) (value ^ code) & mask == 0; that is, if its bit corresponding to bits set in mask are equal to the corresponding bits in code. The union therefore sets to 0 (ignored) the bits where masks or codes differ, and keeps the ones which are the same.
Bonus appreciation to anyone finding a way to exploit the specific properties of the operation to get a solution more efficient than it is possible for the general problem I abstracted! ;-)
For your specific problem you could keep track for each bit x:
The total number of times that bit x is set to 1 in a mask
The total number of times that bit x is set to 1 in a mask and bit x of code is equal to 0
The total number of times that bit x is set to 1 in a mask and bit x of code is equal to 1
With these 3 counts (for each bit) it is straightforward to compute the union of all the filters.
The complexity is O(R) (where R is the number of bits in mask) to add or remove a filter.
Assume that we are working with a language which stores arrays in column-major order. Assume also that we have a function which uses 2-D array as an argument, and returns it.
I'm wondering can you claim that it is (or isn't) in general beneficial to transpose this array when calling the function in order to work with column-wise operations instead of row-wise operations, or does the transposing negate the the benefits of column-wise operations?
As an example, in R I have a object of class ts named y which has dimension n x p, i.e I have p times series of length n.
I need to make some computations with y in Fortran, where I have two loops with following kind of structure:
do i = 1, n
do j= 1, p
!just an example, some row-wise operations on `y`
x(i,j) = a*y(i,j)
D = ddot(m,y(i,1:p),1,b,1)
! ...
end do
end do
As Fortran (as does R) uses column-wise storage, it would be better to make the computations with p x n array instead. So instead of
out<-.Fortran("something",y=array(y,dim(y)),x=array(0,dim(y)))
ynew<-out$out$y
x<-out$out$x
I could use
out<-.Fortran("something2",y=t(array(y,dim(y))),x=array(0,dim(y)[2:1]))
ynew<-t(out$out$y)
x<-t(out$out$x)
where Fortran subroutine something2 would be something like
do i = 1, n
do j= 1, p
!just an example, some column-wise operations on `y`
x(j,i) = a*y(j,i)
D = ddot(m,y(1:p,i),1,b,1)
! ...
end do
end do
Does the choice of approach always depend on the dimensions n and p or is it possible to say one approach is better in terms of computation speed and/or memory requirements? In my application n is usually much larger than p, which is 1 to 10 in most cases.
more of a comment, buy i wanted to put a bit of code: under old school f77 you would essentially be forced to use the second approach as
y(1:p,i)
is simply a pointer to y(1,i), with the following p values contiguous in memory.
the first construct
y(i,1:p)
is a list of values interspaced in memory, so it seems to require making a copy of the data to pass to the subroutine. I say it seems because i haven't the foggiest idea how a modern optimizing compiler deals with these things. I tend to think at best its a wash at worst this could really hurt. Imagine an array so large you need to page swap to access the whole vector.
In the end the only way to answer this is to test it yourself
----------edit
did a little testng and confirmed my hunch: passing rows y(i,1:p) does cost you vs passing columns y(1:p,i). I used a subroutine that does practically nothing to see the difference. My guess with any real subroutine the hit is negligable.
Btw (and maybe this helps understand what goes on) passing every other value in a column
y(1:p:2,i) takes longer (orders of magnitude) than passing the whole column, while passing every other value in a row cuts the time in half vs. passing a whole row.
(using gfortran 12..)
I'm trying to figure out how non-destructive manipulation of large collections is implemented in functional programming, ie. how it is possible to alter or remove single elements without having to create a completely new collection where all elements, even the unmodified ones, will be duplicated in memory. (Even if the original collection would be garbage-collected, I'd expect the memory footprint and general performance of such a collection to be awful.)
This is how far I've got until now:
Using F#, I came up with a function insert that splits a list into two pieces and introduces a new element in-between, seemingly without cloning all unchanged elements:
// return a list without its first n elements:
// (helper function)
let rec skip list n =
if n = 0 then
list
else
match list with
| [] -> []
| x::xs -> skip xs (n-1)
// return only the first n elements of a list:
// (helper function)
let rec take list n =
if n = 0 then
[]
else
match list with
| [] -> []
| x::xs -> x::(take xs (n-1))
// insert a value into a list at the specified zero-based position:
let insert list position value =
(take list position) # [value] # (skip list position)
I then checked whether objects from an original list are "recycled" in new lists by using .NET's Object.ReferenceEquals:
open System
let (===) x y =
Object.ReferenceEquals(x, y)
let x = Some(42)
let L = [Some(0); x; Some(43)]
let M = Some(1) |> insert L 1
The following three expressions all evaluate to true, indicating that the value referred to by x is re-used both in lists L and M, ie. that there is only 1 copy of this value in memory:
L.[1] === x
M.[2] === x
L.[1] === M.[2]
My question:
Do functional programming languages generally re-use values instead of cloning them to a new memory location, or was I just lucky with F#'s behaviour? Assuming the former, is this how reasonably memory-efficient editing of collections can be implemented in functional programming?
(Btw.: I know about Chris Okasaki's book Purely functional data structures, but haven't yet had the time to read it thoroughly.)
I'm trying to figure out how
non-destructive manipulation of large
collections is implemented in
functional programming, ie. how it is
possible to alter or remove single
elements without having to create a
completely new collection where all
elements, even the unmodified ones,
will be duplicated in memory.
This page has a few descriptions and implementations of data structures in F#. Most of them come from Okasaki's Purely Functional Data Structures, although the AVL tree is my own implementation since it wasn't present in the book.
Now, since you asked, about reusing unmodified nodes, let's take a simple binary tree:
type 'a tree =
| Node of 'a tree * 'a * 'a tree
| Nil
let rec insert v = function
| Node(l, x, r) as node ->
if v < x then Node(insert v l, x, r) // reuses x and r
elif v > x then Node(l, x, insert v r) // reuses x and l
else node
| Nil -> Node(Nil, v, Nil)
Note that we re-use some of our nodes. Let's say we start with this tree:
When we insert an e into the the tree, we get a brand new tree, with some of the nodes pointing back at our original tree:
If we don't have a reference to the xs tree above, then .NET will garbage collect any nodes without live references, specifically thed, g and f nodes.
Notice that we've only modified nodes along the path of our inserted node. This is pretty typical in most immutable data structures, including lists. So, the number of nodes we create is exactly equal to the number of nodes we need to traverse in order to insert into our data structure.
Do functional programming languages
generally re-use values instead of
cloning them to a new memory location,
or was I just lucky with F#'s
behaviour? Assuming the former, is
this how reasonably memory-efficient
editing of collections can be
implemented in functional programming?
Yes.
Lists, however, aren't a very good data structure, since most non-trivial operations on them require O(n) time.
Balanced binary trees support O(log n) inserts, meaning we create O(log n) copies on every insert. Since log2(10^15) is ~= 50, the overhead is very very tiny for these particular data structures. Even if you keep around every copy of every object after inserts/deletes, your memory usage will increase at a rate of O(n log n) -- very reasonable, in my opinion.
How it is possible to alter or remove single elements without having to create a completely new collection where all elements, even the unmodified ones, will be duplicated in memory.
This works because no matter what kind of collection, the pointers to the elements are stored separately from the elements themselves. (Exception: some compilers will optimize some of the time, but they know what they are doing.) So for example, you can have two lists that differ only in the first element and share tails:
let shared = ["two", "three", "four"]
let l = "one" :: shared
let l' = "1a" :: shared
These two lists have the shared part in common and their first elements different. What's less obvious is that each list also begins with a unique pair, often called a "cons cell":
List l begins with a pair containing a pointer to "one" and a pointer to the shared tail.
List l' begins with a pair containing a pointer to "1a" and a pointer to the shared tail.
If we had only declared l and wanted to alter or remove the first element to get l', we'd do this:
let l' = match l with
| _ :: rest -> "1a" :: rest
| [] -> raise (Failure "cannot alter 1st elem of empty list")
There is constant cost:
Split l into its head and tail by examining the cons cell.
Allocate a new cons cell pointing to "1a" and the tail.
The new cons cell becomes the value of list l'.
If you're making point-like changes in the middle of a big collection, typically you'll be using some sort of balanced tree which uses logarithmic time and space. Less frequently you may use a more sophisticated data structure:
Gerard Huet's zipper can be defined for just about any tree-like data structure and can be used to traverse and make pointlike modifications at constant cost. Zippers are easy to understand.
Paterson and Hinze's finger trees offer very sophisticated representations of sequences, which among other tricks enable you to change elements in the middle efficiently—but they are hard to understand.
While the referenced objects are the same in your code,
I beleive the storage space for the references themselves
and the structure of the list
is duplicated by take.
As a result, while the referenced objects are the same,
and the tails are shared between the two lists,
the "cells" for the initial portions are duplicated.
I'm not an expert in functional programming,
but maybe with some kind of tree you could achieve
duplication of only log(n) elements,
as you would have to recreate only the path from the root
to the inserted element.
It sounds to me like your question is primarily about immutable data, not functional languages per se. Data is indeed necessarily immutable in purely functional code (cf. referential transparency), but I'm not aware of any non-toy languages that enforce absolute purity everywhere (though Haskell comes closest, if you like that sort of thing).
Roughly speaking, referential transparency means that no practical difference exists between a variable/expression and the value it holds/evaluates to. Because a piece of immutable data will (by definition) never change, it can be trivially identified with its value and should behave indistinguishably from any other data with the same value.
Therefore, by electing to draw no semantic distinction between two pieces of data with the same value, we have no reason to ever deliberately construct a duplicate value. So, in cases of obvious equality (e.g., adding something to a list, passing it as a function argument, &c.), languages where immutability guarantees are possible will generally reuse the existing reference, as you say.
Likewise, immutable data structures possess an intrinsic referential transparency of their structure (though not their contents). Assuming all contained values are also immutable, this means that pieces of the structure can safely be reused in new structures as well. For example, the tail of a cons list can often be reused; in your code, I would expect that:
(skip 1 L) === (skip 2 M)
...would also be true.
Reuse isn't always possible, though; the initial portion of a list removed by your skip function can't really be reused, for instance. For the same reason, appending something to the end of a cons list is an expensive operation, as it must reconstruct a whole new list, similar to the problem with concatenating null-terminated strings.
In such cases, naive approaches quickly get into the realm of awful performance you were concerned about. Often, it's necessary to substantially rethink fundamental algorithms and data structures to adapt them successfully to immutable data. Techniques include breaking structures into layered or hierarchical pieces to isolate changes, inverting parts of the structure to expose cheap updates to frequently-modified parts, or even storing the original structure alongside a collection of updates and combining the updates with the original on the fly only when the data is accessed.
Since you're using F# here I'm going to assume you're at least somewhat familiar with C#; the inestimable Eric Lippert has a slew of posts on immutable data structures in C# that will probably enlighten you well beyond what I could provide. Over the course of several posts he demonstrates (reasonably efficient) immutable implementations of a stack, binary tree, and double-ended queue, among others. Delightful reading for any .NET programmer!
You may be interested in reduction strategies of expressions in functional programming languages. A good book on the subject is The Implementation of Functional Programming Languages, by Simon Peyton Jones, one of the creators of Haskell.
Have a look especially at the following chapter Graph Reduction of Lambda Expressions since it describes the sharing of common subexpressions.
Hope it helps, but I'm afraid it applies only to lazy languages.
Let's say you have two lists, L1 and L2, of the same length, N. We define prodSum as:
def prodSum(L1, L2) :
ans = 0
for elem1, elem2 in zip(L1, L2) :
ans += elem1 * elem2
return ans
Is there an efficient algorithm to find, assuming L1 is sorted, the number of permutations of L2 such that prodSum(L1, L2) < some pre-specified value?
If it would simplify the problem, you may assume that L1 and L2 are both lists of integers from [1, 2, ..., N].
Edit: Managu's answer has convinced me that this is impossible without assuming that L1 and L2 are lists of integers from [1, 2, ..., N]. I'd still be interested in solutions that assume this constraint.
I want to first dispell a certain amount of confusion about the math, then discuss two solutions and give code for one of them.
There is a counting class called #P which is a lot like the yes-no class NP. In a qualitative sense, it is even harder than NP. There is no particular reason to believe that this counting problem is any better than #P-hard, although it could be hard or easy to prove that.
However, many #P-hard problems and NP-hard problems vary tremendously in how long they take to solve in practice, and even one particular hard problem can be harder or easier depending on the properties of the input. What NP-hard or #P-hard mean is that there are hard cases. Some NP-hard and #P-hard problems also have less hard cases or even outright easy cases. (Others have very few cases that seem much easier than the hardest cases.)
So the practical question could depend a lot on the input of interest. Suppose that the threshold is on the high side or on the low side, or you have enough memory for a decent number of cached results. Then there is a useful recursive algorithm that makes use of two ideas, one of them already mentioned: (1) After partially assigning some of the values, the remaining threshold for list fragments may rule out all of the permutations, or it may allow all of them. (2) Memory permitting, you should cache the subtotals for some remaining threshold and some list fragments. To improve the caching, you might as well pick the elements from one of the lists in order.
Here is a Python code that implements this algorithm:
list1 = [1,2,3,4,5,6,7,8,9,10,11]
list2 = [1,2,3,4,5,6,7,8,9,10,11]
size = len(list1)
threshold = 396 # This is smack in the middle, a hard value
cachecutoff = 6 # Cache results when up to this many are assigned
def dotproduct(v,w):
return sum([a*b for a,b in zip(v,w)])
factorial = [1]
for n in xrange(1,len(list1)+1):
factorial.append(factorial[-1]*n)
cache = {}
# Assumes two sorted lists of the same length
def countprods(list1,list2,threshold):
if dotproduct(list1,list2) <= threshold: # They all work
return factorial[len(list1)]
if dotproduct(list1,reversed(list2)) > threshold: # None work
return 0
if (tuple(list2),threshold) in cache: # Already been here
return cache[(tuple(list2),threshold)]
total = 0
# Match the first element of list1 to each item in list2
for n in xrange(len(list2)):
total += countprods(list1[1:],list2[:n] + list2[n+1:],
threshold-list1[0]*list2[n])
if len(list1) >= size-cachecutoff:
cache[(tuple(list2),threshold)] = total
return total
print 'Total permutations below threshold:',
print countprods(list1,list2,threshold)
print 'Cache size:',len(cache)
As the comment line says, I tested this code with a hard value of the threshold. It is quite a bit faster than a naive search over all permutations.
There is another algorithm that is better than this one if three conditions are met: (1) You don't have enough memory for a good cache, (2) the list entries are small non-negative integers, and (3) you're interested in the hardest thresholds. A second situation to use this second algorithm is if you want counts for all thresholds flat-out, whether or not the other conditions are met. To use this algorithm for two lists of length n, first pick a base x which is a power of 10 or 2 that is bigger than n factorial. Now make the matrix
M[i][j] = x**(list1[i]*list2[j])
If you compute the permanent of this matrix M using the Ryser formula, then the kth digit of the permanent in base x tells you the number of permutations for which the dot product is exactly k. Moreover, the Ryser formula is quite a bit faster than the summing over all permutations directly. (But it is still exponential, so it does not contradict the fact that computing the permanent is #P-hard.)
Also, yes it is true that the set of permutations is the symmetric group. It would be great if you could use group theory in some way to accelerate this counting problem. But as far as I know, nothing all that deep comes from that description of the question.
Finally, if instead of exactly counting the number of permutations below a threshold, you only wanted to approximate that number, then probably the game changes completely. (You can approximate the permanent in polynomial time, but that doesn't help here.) I'd have to think about what to do; in any case it isn't the question posed.
I realized that there is another kind of caching/dynamic programming that is missing from the above discussion and the above code. The caching implemented in the code is early-stage caching: If just the first few values of list1 are assigned to list2, and if a remaining threshold occurs more than once, then the cache allows the code to reuse the result. This works great if the entries of list1 and list2 are integers that are not too large. But it will be a failed cache if the entries are typical floating point numbers.
However, you can also precompute at the other end, when most of the values of list1 have been assigned. In this case, you can make a sorted list of the subtotals for all of the remaining values. And remember, you can use up list1 in order, and do all of the permutations on the list2 side. For example, suppose that the last three entries of list1 are [4,5,6], and suppose that three of the values in list2 (somewhere in the middle) are [2.1,3.5,3.7]. Then you would cache a sorted list of the six dot products:
endcache[ [2.1, 3.5, 3.7] ] = [44.9, 45.1, 46.3, 46.7, 47.9, 48.1]
What does this do for you? If you look in the code that I did post, the function countprods(list1,list2,threshold) recursively does its work with a sub-threshold. The first argument, list1, might have been better as a global variable than as an argument. If list2 is short enough, countprods can do its work much faster by doing a binary search in the list endcache[list2]. (I just learned from stackoverflow that this is implemented in the bisect module in Python, although a performance code wouldn't be written in Python anyway.) Unlike the head cache, the end cache can speed up the code a lot even if there are no numerical coincidences among the entries of list1 and list2. Ryser's algorithm also stinks for this problem without numerical coincidences, so for this type of input I only see two accelerations: Sawing off a branch of the search tree using the "all" test and the "none" test, and the end cache.
Probably not (without the simplifying assumption): your problem is NP-Hard. Here's a trivial reduction to SUBSET-SUM. Let count_perms(L1, L2, x) represent the function "count the number of permutations of L2 such that prodSum(L1, L2) < x"
SUBSET_SUM(L2,n): # (determine if any subset of L2 adds up to n)
For i in [1,...,len(L2)]
Set L1=[0]*(len(L2)-i)+[1]*i
calculate count_perms(L1,L2,n+1)-count_perms(L1,L2,n)
if result positive, return true
Return false
Thus, if there were a way to calculate your function count_perms(L1, L2, x) efficiently, then we would have an efficient algorithm to calculate SUBSET_SUM(L2,n).
This also turns out to be an abstract algebra problem. It's been awhile for me, but here's a few things to get started. There's nothing terribly significant about the following (it's all very basic; an expansion on the fact that every group is isomorphic to a permutation group), but it provides a different way of looking at the problem.
I'll try to stick to fairly standard notation: "x" is a vector, and "xi" is the ith component of x. If "L" is a list, L is the equivalent vector. "1n" is a vector with all components = 1. The set of natural numbers ℕ is taken to be the positive integers. "[a,b]" is the set of integers from a through b, inclusive. "θ(x, y)" is the angle formed by x and y
Note prodSum is the dot product. The question is equivalent to finding all vectors L generated by an operation (permuting elements) on L2 such that θ(L1, L) less than a given angle α. The operation is equivalent to reflecting a point in ℕn through a subspace with presentation:
< ℕn | (xixj-1)(i,j) ∈ A >
where i and j are in [1,n], A has at least one element and no (i,i) is in A (i.e. A is a non-reflexive subset of [1,n]2 where |A| > 0). Stated more plainly (and more ambiguously), the subspaces are the points where one or more components are equal to one or more other components. The reflections correspond to matrices whose columns are all the standard basis vectors.
Let's name the reflection group "RPn" (it should have another name, but memory fails). RPn is isomorphic to the symmetric group Sn. Thus
|RPn| = |Sn| = n!
In 3 dimensions, this gives a group of order 6. The reflection group is D3, the triangle symmetry group, as a subgroup of the cube symmetry group. It turns out you can also generate the points by rotating L2 in increments of π/3 around the line along 1n. This is the the modular group ℤ6 and this points to a possible solution: find a group of order n! with a minimal number of generators and use that to generate the permutations of L2 as sequences with increasing, then decreasing, angle with L2. From there, we can try to generate the elements L with θ(L1, L) < α directly (for example we can binsearch on the 1st half of each sequence to find the transition point; with that, we can specify the rest of the sequence that fulfills the condition and count it in O(1) time). Let's call this group RP'n.
RP'4 is constructed of 4 subspaces isomorphic to ℤ6. More generally, RP'n is constructed of n subspaces isomorphic to RP'n-1.
This is where my abstract algebra muscles really begins to fail. I'll try to keep working on the construction, but Managu's answer doesn't leave much hope. I fear that reducing RP3 to ℤ6 is the only useful reduction we can make.
It looks like if l1 and l2 are both ordered high->low (or low->high, whatever, if they have the same order), the result is maximized, and if they are ordered oposite, the result is minimized, and other alterations of order appear to follow some rules; swapping two numbers in a continuous list of integers always reduces the sum by a fixed amount which seems to be related to their distance apart (ie swapping 1 and 3 or 2 and 4 have the same effect). This was just from a little messing around, but the idea is that there is a maximum, a minimum, and if some-pre-specified-value is between them, there are ways to count the permutations that make that possible (although; if the list isn't evenly spaced, then there aren't. Well, not that I know of. If l2 is (1 2 4 5) swapping 1 2 and 2 4 would have different effects)