Random Sampling From Histogram/Frequency Hash Table - random

I am currently trying to come up with a semi-decent (considering complexity, statistical properties and common sense) algorithm for sampling.
The data is currently contained inside a hash table, where each key is an item and the key's value is the item's frequency in the original distribution.
If one wanted to sample from such histogram, how would he go about doing that if he wanted to preserve the original probabilities of the items and transfer them into the sample?
Also, we require that there is a flag of whether duplicate items are allowed in the sample. In the case of not allowing the duplicates, the best I came up with is to apply the algorithm from the paragraph above and delete the item from the hash table once it is sampled. This way, at least the relative probabilities are preserved amongst the remaining items. However, I am unsure of whether this is an accepted practice statistically.
Is there a generally accepted algorithm for doing this? If it helps, we need to implement it in Common Lisp.

This is a part of the answer. It uses lists instead of hash table:
(defun random-item-with-prob (prob-item-pairs)
"The argument PROB-ITEM-PAIRS is ((p_1 item_1) (p_2 item_2) ...
(p_n item_n)). The function returns one of the items according to the
probabilities. "
(loop with p = (random 1.0)
with x = 0
for pair in prob-item-pairs
do
(if (< p (+ (first pair) x))
(return (second pair))
(incf x (first pair)))))
For the second part of your question: If you want to sample according to frequencies, this means that you care about the distribution of the data. Removing items (or not allowing duplicates) alters the distribution during the sampling procedure. If you really want to do that, you can repeat calls to the previous function, removing duplicates until you have the desired sample size.

Related

Efficiently counting independent events in a given time range

Problem description
I came across this problem a couple of times and always wondered if my solution was optimal or (far more probable) there is a better one.
Say that my component receives events, that are composed of a time and a string. For every event I receive I need to return how many independent strings were seen in the last x seconds. (x is configurable but fixed at the beginning of the execution). By "last x seconds" I mean the time range that ends at the time of the event with the highest timestamp and lasts x seconds (both ends included).
Let me give an example. I receive the following events (represented by the (time, string) pair) in the given order, and for every event I show the expected return value, assuming x = 5.
(1, "a") → 1
(2, "b") → 2
(3, "a") → 2
(7, "c") → 3
(9, "c") → 1
(8, "d") → 2
We cannot assume the events to come in the right order, but we can assume the discrepancies to be small, i.e., that if you put the events in an ordered list, you are in most cases adding the event at the end of the list or very close to it.
Also, the string is a simplification here. They are actually objects that one can compare for equality and compute an hash of, but not order nor do fancy things with. (I will nevertheless call these objects "strings" in the rest of the question.)
My solution
I would use two data structures: a double-ended queue and a hash map. The former contains the events in order of time, the latter contains the strings seen in the last x seconds along with a counter of how many time they have been seen.
For every event received I would add it to the queue and increase the counter in the map. Then I would move to the beginning of the queue and remove from it all those event whose timestamp is too old (i.e. lower than time_of_last_event - x); for each removed event I would decrement the corresponding counter in the map, and remove the entry from the map if its counter is zero. At the end the size of the map is the number I have to return.
If out-of-order events occur often, but events are "almost-in-order", I could consider to use a double-linked list rather than a double-ended-queue; when inserting events I would search backwards from its end to find the suitable place for the event to insert. This would save me from too many reallocations, but I'm not sure allocating memory for each event I insert would pay off in terms of performance.
Assuming constant time insertion at the end of the queue, constant-time removal from its beginning and constant-time operations in the hash map, I would say that each call to this algorithm would be amortized constant-time (in the long run, I will remove as many entries as I insert, so in average one per call).
The main question is: Is there a better algorithm than the one described?
By better I mean and algorithm that would run faster or use less memory.
A few more questions
Is anything wrong with this algorithm?
Is this a well-known problem? Is there a name for it? I could not find anything, but I might have searched the wrong keywords.
Your solution is efficient enough if you receive the events in order, but if not, the algorithm becomes quadratic, as the effort to insert a new element in the doubly linked list in a sorted position will need linear time (in terms of the size of the list). This may not be an issue if you regard x a constant, because then the total time complexity is still O(n). If however, you think of x as a variable, then this is not optimal.
The solution is to use a Min Heap instead of a queue or doubly linked list:
Insert each new element in the heap, keyed by the time stamp. When the heap gets a size that is one greater than x, then remove the root of the heap, as it represents the oldest entry compared to the x other entries in the heap.
For the rest you can continue with the hash map as you were doing.
As an insertion and a removal from the Heap has O(logx) time complexity, the total time complexity is O(nlogx).

Clojure "same value" allocation

Say I have the following code
(let [a (some-function 3)
b (some-function (+ 1 2))
c a]
(= a b))
Suppose some-function returns a very big data structure (say one million sized array)
First question: how much memory will clojure allocate? Will it allocate memory for each of these 3 vectors or they will share the same data structure?
Second question (closely related): how fast would it be to compare them? Will = iterate over each element or not?
This simplified example may look dumb but there are similar real life situations where this matters a lot like
(map some-function [1 23 1 32 1 44 1 5 1 1 1 1])
EDIT
in my specific case, some-function returns a conj of two sets which may be very big
Whether a and b share the structure depends entirely on what some-function does. It's impossible to answer without any details. c and a will be bound to the same value, so no additional memory allocation for c.
In general, semantics and behavior of = depend on what values you compare. If you compare collections, then yes, = will iterate through each element until it either exhausts them, or finds the first pair of unequal ones.
Additionally:
= will first check if the objects being compared are the same (reference equality,) so comparison of a and c will be instant.
While it's impossible to say for certain if different some-function invocation return the same or different structure, chances are without memoization they will be different. Just wrap your some-function using memoize to be certain there is no double allocation (of course assuming your some-function is pure.) As simple as
(def some-function (memoize some-function))

Algorithm for predicting most likely items from lists of data

Lets say I have N lists which are known. Each list has items, which may repeat (Not a set)
eg:
{A,A,B,C}, {A,B,C}, {B,B,B,C,C}
I need some algorithm (Some machine-learning one maybe?) which answers the following question:
Given a new & unknown partial list of items, for example, {A,B}, what is the probability that C will appear in the list based on the what I know from the previous lists. If possible, I would like a more fine-grained probability of: given some partial list L, what is the probability that C will appear in the list once, probability it will appear twice, etc... Order doesn't matter. The probability of C appearing twice in {A,B} should equal it appearing twice in {B,A}
Any algorithms which can do this?
This is just pure mathematics, no actual "algorithms", simply estimate all the probabilities from your dataset (literally count the occurences). In particular you can do very simple data structure to achieve your goal. Represent each "list" as bag of letters, thus:
{A,A,B,C} -> {A:2, B:1, C:1}
{A,B} -> {A:1, B:1}
etc. and create basic reverse indexing of some sort, for example keep indexes for each letter separately, sorted by their counts.
Now, when a query comes, like {A,B} + C all you do is you search for your data that contains at least 1 A and 1 B (using your indexes), and then estimate probability by computing the fraction of retrived results containing C (or exactly one C) vs. all retrived results (this is a valid probability estimate assuming that your data is a bunch of independent samples from some underlying data-generating distribution).
Alternatively, if your alphabet is very small you can actually precompute all the values P(C|{A,B}) etc. for all combinations of letters.

Clojure DAG (Bayesian Network)

I would like to build a Bayesian Network in clojure, since I haven't found any similar project.
I have studied a lot of theory of BN but still I can't see how implement the network (I am not what people call "guru" for anything, but especially not for functional programming).
I do know that a BN is nothing more than a DAG and a lot probability table (one for each node) but now I have no glue how to implement the DAG.
My first idea was a huge set (the DAG) with some little maps (the node of the DAG), every map should have a name (probably a: key) a probability table (another map?) A vector of parents and finally a vector of non-descendant.
Now I don't know how to implement the reference of the parents and non-descendants (what I should put in the two vector).
I guess that a pointer should be perfect, but clojure lack of it; I could put in the vector the: name of the other node but it is going to be slow, doesn't it?
I was thinking that instead of a vector I could use more set, in this way would be faster find the descendants of a node.
Similar problem for the probability table where I still need some reference at the other nodes.
Finally I also would like to learn the BN (build the network starting by the data) this means that I will change a lot both probability tables, edge, and nodes.
Should I use mutable types or they would only increment the complexity?
This is not a complete answer, but here is a possible encoding for the example network from the wikipedia article. Each node has a name, a list of successors (children) and a probability table:
(defn node [name children fn]
{:name name :children children :table fn})
Also, here are little helper functions for building true/false probabilities:
;; builds a true/false probability map
(defn tf [true-prob] #(if % true-prob (- 1.0 true-prob)))
The above function returns a closure, which, when given a true value (resp. false value), returns the probability of the event X=true (for the X probability variable we are encoding).
Since the network is a DAG, we can references directly nodes to each other (exactly like the pointers you mentioned) without having to care about circular references. We just build the graph in topological order:
(let [gw (node "grass wet" [] (fn [& {:keys [sprinkler rain]}]
(tf (cond (and sprinkler rain) 0.99
sprinkler 0.9
rain 0.8
:else 0.0))))
sk (node "sprinkler" [gw]
(fn [& {:keys [rain]}] (tf (if rain 0.01 0.4))))
rn (node "rain" [sk gw]
(constantly (tf 0.2)))]
(def dag {:nodes {:grass-wet gw :sprinkler sk :rain rn}
:joint (fn [g s r]
(*
(((:table gw) :sprinkler s :rain r) g)
(((:table sk) :rain r) s)
(((:table rn)) r)))}))
The probability table of each node is given as a function of the states of the parent nodes and returns the probability for true and false values. For example,
((:table (:grass-wet dag)) :sprinkler true :rain false)
... returns {:true 0.9, :false 0.09999999999999998}.
The resulting joint function combines probabilities according the this formula:
P(G,S,R) = P(G|S,R).P(S|R).P(R)
And ((:joint dag) true true true) returns 0.0019800000000000004.
Indeed, each value returned by ((:table <x>) <args>) is a closure around an if, which returns probability knowing the state of the probability variable. We call each closure with the respective true/false value to extract the appropriate probability, and multiply them.
Here, I am cheating a little because I suppose that the joint function should be computed by traversing the graph (a macro could help, in the general case). This also feels a little messy, notably regarding nodes's states, which are not necessarly only true and false: you would most likely use a map in the general case.
In general, the way to compute the joint distribution of a BN is
prod( P(node | parents of node) )
To achive this, you need a list of nodes where each node contains
node name
list of parents
probability table
list of children
probability table maybe is easiest to handle when flat with each row value corresponding to a parent configuration and each column corresponding to a value for the node. This assumes you are using a record to hold all of the values. The value of the node can be contained within the node also.
Nodes with no parents have only one row.
Each row should be normalized after which P(node|parents) = table[row,col]
You don't really need the list of children but having it could make topological sorting easier. A DAG must be capable of being topologically sorted.
The biggest problem arises as the number of cells in the probability table is the product of all of the dimensions of the parents and self. I handled this in C++ using a sparse table using row mapping.
Querying the DAG is a different matter and the best method for doing this depends on size and whether the an approximate answer is sufficient. There isn't enough room to cover them here. Search for Murphy and the Bayes Net Toolbox might be helpful
I realize you are specifically looking for an implementation but, with a little work, you can roll your own.
You may try to go even flatter and have several maps indexed by node ids: one map for probabilities tables, one for parents and one for non-descendants (I'm no BN expert: what's this, how is it used etc. ? It feels like something that could be recomputed from the parents table^W relation^W map).

Genetic algorithms: How to do crossover in "subset" problems?

I have a problem which I am trying to solve with genetic algorithms. The problem is selecting some subset (say 4) of 100 integers (these integers are just ids that represent something else). Order does not matter, the solution to the problem is a SET of integers not an ordered list. I have a good fitness function but am having trouble with the crossover function.
I want to be able to mate the following two chromosomes:
[1 2 3 4] and
[3 4 5 6] into something useful. Clearly I cannot use the typical crossover function because I could end up with duplicates in my children which would represent invalid solutions. What is the best crossover method in this case.
Just ignore any element that occurs in both of the sets (i.e. in their intersection.), that is leave such elements unchanged in both sets.
The rest of the elements form two disjoint sets, to which you can apply pretty much any random transformation (e.g. swapping some pairs randomly) without getting duplicates.
This can be thought of as ordering and aligning both sets so that matching elements face each other and applying one of the standard crossover algorithms.
Sometimes it is beneficial to let your solution go "out of bounds" so that your search will converge more quickly. Rather than making a set of 4 unique integers a requirement for your chromosome, make the number of integers (and their uniqueness) part of the fitness function.
Since order doesn't matter, just collect all the numbers into an array, sort the array, throw out the duplicates (by disconnecting them from a linked list, or setting them to a negative number, or whatever). Shuffle the array and take the first 4 numbers.
I don't really know what you mean on "typical crossover", but I think you could use a crossover similar to what is often used for permutations:
take m ints from the first parent (m < n, where n is the number of ints in your sets)
scan the second and fill your subset from it with (n-m) ints that are free (not in the subset already).
This way you will have n ints from the first and n-m ints from the second parent, without duplications.
Sounds like a valid crossover for me :-).
I guess it might be beneficial not to do either steps on ordered sets (or using an iterator where the order of returned elements correlates somehow with the natural ordering of ints), otherwise either smaller or higher numbers will get a higher chance to be in the child making your search biased.
If it is the best method depends on the problem you want to solve...
In order to combine sets A and B, you could choose the resulting set S probabilistically so that the probability that x is in S is (number of sets out of A, B, which contain x) / 2. This will be guaranteed to contain the intersection and be contained in the union, and will have expected cardinality 4.

Resources