Reduce time in searching a list of strings with Regex

Reduce time in searching a list of strings with Regex - algorithm

I have 130,000 records of strings. I want to find the number of occurance of each string element in those 130,000 records by doing the regex search for each record instead of simple search.
Currently I am doing it with kind of simple linear search. Which is taking hours.
What are some other approaches that can reduce this time?
The method I followed is:
I took a string element from 130,000 records, and did regex search (linear) then I count the findings against that picked string element.
And I did the same for each element.
So there are 130,000 iterations for each string element.
Total iterations will be
130,000 * 130,000 = some huge number.
How can I reduce these iterations?

I'm not sure I understand the question completely. I'm confused about what the records are and what you mean by finding occurrences in the records.
If that is the case, and you do the simplest approach, maybe something like
records = [
"foo", "bar", "foobar", "baz", "bar", "qux"
]
for rec in records:
count = sum(x == rec for x in records)
print(count)
it takes O(n²m) if n is the number of records and m the average length of the strings.
If you put the strings in a (hash) set, you can update it in time O(m) (to hash and compare) so something like
from collections import Counter
print(Counter(records))
only takes O(nm).
I suspect that the problem is harder than this, though.
Is the problem to find how often one record occurs as a substring of another?
That you can do with a (generalised) suffix tree or suffix array. Concatenate your records "foo[1]bar[2]foobar[3]baz[4]bar[5]qux[6]" with sentinel characters [i] between them. You need 130,000 sentinels which will be a slight problem, but if you remap the alphabet to put the real letters in the range 0..K you can use integers K.. as sentinels. You will want to a construction algorithm that is independent of alphabet size then.
Anyway, each record will be represented as a leaf that ends in a sentinel. That record occurs more than once if the edge going to that leaf only has the sentinel on it, and the number of occurrences is the size of the sub-tree that is rooted at the leaf's parent.
It takes O(nm) to build the suffix tree--this is optimal as that is the size of your data. If you traverse the tree and annotate each node with the size of the sub-tree, that will take time O(nm) as well. Then, in another O(nm) traversal you can identify all the relevant leaves and report the number of occurrences. All in all O(nm) but now reporting how often each record appears as a substring of another.
Update
I had a little time between meetings to write a rough muck-up of the idea. The complete code is below, and I use a library I use for teaching, pystr, to build the suffix tree. This isn't production code, so it probably isn't fast enough, but it should give a hint at the idea.
Anyway, I would take all the records and make a single string out of them, so I can build a suffix tree. I took records such as "foo", "bar", "foobar" and translated it into the string [1]foo[2]bar[3]foobar where [i] is the integer i.
BookKeeping = namedtuple("BookKeeping", "k x recs")
def remap(recs: list[str]) -> BookKeeping:
"""
Map records to a single string.
The strings arre mapped to an alphabet that
starts at the integer k+1, where k is the number
of records. Each record i has the unique letter i
in front of it. I count records from 1 since 0
is used as a sentinel in the ST construction.
With this construction, a leaf with index i is a
record if x[i-1] == <i> and we always have a branch
just after a record.
"""
k = len(recs)
letters = set.union(*(set(rec) for rec in recs))
alphabet = {a: i+k+1 for i, a in enumerate(letters)}
# The recid is a hack to map indices to records. You can
# do better than this.
x: list[int] = []
recid: list[Optional[int]] = []
for i, rec in enumerate(recs):
x.append(i+1)
recid.append(None)
x.extend(alphabet[a] for a in rec)
recid.extend([i+1] * len(rec))
return BookKeeping(k, x, recid)
I start at 1 because my library uses 0 as its terminal in the suffix tree, so that will not be a unique letter. All the others will be, and that is what I then exploit. If I have the record i, the string is [i]...[i+1] (or [i]...[0] for the final record), and in the suffix tree I am interested in the string ...[i+1]. Since [i+1] is a unique letter, there will always be a branch there, going down to a leaf. So I can check if I am looking at a leaf the represents a full record with
def is_rec_index(bk: BookKeeping, n: Leaf) -> bool:
"""
Check if n's index is into a record.
It is a record index if it is an index into one of the
original records, and it is that if it isn't any of
the record sentinel letters or the terminal sentinel.
"""
return 0 < n.leaf_label < len(bk.x) and \
bk.x[n.leaf_label] > bk.k
def is_record(bk: BookKeeping, n: Leaf) -> bool:
"""
Check if index i is representing a record.
We have a record if we start at an index just to
the right of a record letter, i.e., x[i-1] <= k,
and we are sitting on an edge that starts with
the following record letter.
"""
if not is_rec_index(bk, n):
return False
rec = bk.x[n.leaf_label - 1]
if rec > bk.k:
return False
# We started at the first letter in a record,
# so check if we branched at the next rec letter
return n.edge_label[0] == rec + 1
The rest of the algorithm is just annotating the suffix tree with what information I need and then reporting for each record leaf. The parent of a record leaf is the subtree that starts with the record, and all records that contain it will be in this subtree.
I've implemented two algorithms. One is fast, and counts how many times a record appears outside itself, and the other is slow but reports which records it appears in. This is the annotation function:
def annotate(bk: BookKeeping, n: Node, depth: int = 0) -> None:
"""
Annotates the tree's nodes with the info we need to
collect the record information.
"""
# FIXME: You should probably use an explicit stack here.
# You can easily run out of recursion stack.
n.depth = depth
if isinstance(n, Leaf):
n.is_record = is_record(bk, n)
n.recs = get_record(bk, n) # collect the records represented
n.size = int(is_rec_index(bk, n)) # a real index counts as one
if isinstance(n, Inner):
# Recursively annotate
for child in n.children.values():
annotate(bk, child, depth + len(child.edge_label))
# Compute the size. This will in total run in time the
# size of the suffix tree, which is optimal
n.size = 0
for child in n.children.values():
n.size += child.size
# FIXME: You need something smarter here! This can take
# time O(n) for each of O(nm) inner nodes, where n is the
# number of records. A bit-vector can cut this to O(n/w) where
# w is the word size, but it is still O(n²m/w) and n is large.
n.recs = set.union(*(child.recs for child in n.children.values()))
If I only need to know how many leaves each node has, I can do it in constant time per node, but to collect the records that appear in subtrees I have to merge sets. You should be able to fix this if you do the collection and annotation in the same traversal, so you can drag a set along with you. If you add to the set as you traverse the tree, collect what you need, and then move up the recursion, I don't think you have to merge sets. But I don't have time to implement this right now.
Anyway, after the node annotation you can count or collect like this:
def collect(bk: BookKeeping,
n: Node,
res: None | dict[int, list[int]] = None
) -> dict[int, list[int]]:
"""Collect which records any given record appear as a sub-string in."""
res = res if res is not None else {}
# FIXME: You should probably use an explicit stack here.
# You can easily run out of recursion stack.
if isinstance(n, Leaf) and n.is_record:
res[bk.x[n.leaf_label-1]] = n.parent.recs
if isinstance(n, Inner):
for child in n.children.values():
collect(bk, child, res)
return res # we return this just for convinience
def count(bk: BookKeeping,
n: Node,
res: None | dict[int, int] = None
) -> dict[int, int]:
"""Count how many records contain each rec as a substring."""
res = res if res is not None else {}
# FIXME: You should probably use an explicit stack here.
# You can easily run out of recursion stack.
if isinstance(n, Leaf) and n.is_record:
res[bk.x[n.leaf_label-1]] = n.parent.size - 1 # don't count itself
if isinstance(n, Inner):
for child in n.children.values():
count(bk, child, res)
return res # we return this just for convinience
The complete code is below. I hope it makes sense; otherwise, feel free to ask.
"""Computing record summaries."""
from typing import Optional
from collections import namedtuple
from pystr.suffixtree import (
SuffixTree, Node, Inner, Leaf,
mccreight_st_construction
)
BookKeeping = namedtuple("BookKeeping", "k x recs")
def remap(recs: list[str]) -> BookKeeping:
"""
Map records to a single string.
The strings arre mapped to an alphabet that
starts at the integer k+1, where k is the number
of records. Each record i has the unique letter i
in front of it. I count records from 1 since 0
is used as a sentinel in the ST construction.
With this construction, a leaf with index i is a
record if x[i-1] == <i> and we always have a branch
just after a record.
"""
k = len(recs)
letters = set.union(*(set(rec) for rec in recs))
alphabet = {a: i+k+1 for i, a in enumerate(letters)}
# The recid is a hack to map indices to records. You can
# do better than this.
x: list[int] = []
recid: list[Optional[int]] = []
for i, rec in enumerate(recs):
x.append(i+1)
recid.append(None)
x.extend(alphabet[a] for a in rec)
recid.extend([i+1] * len(rec))
return BookKeeping(k, x, recid)
def build_st(x: list[int]) -> SuffixTree:
"""Build a suffix tree."""
# Don't use McCreight, it doesn't handle very large
# alphabets well... there are better algorithms, I just
# don't have an implementation.
return mccreight_st_construction(x)
def is_rec_index(bk: BookKeeping, n: Leaf) -> bool:
"""
Check if n's index is into a record.
It is a record index if it is an index into one of the
original records, and it is that if it isn't any of
the record sentinel letters or the terminal sentinel.
"""
return 0 < n.leaf_label < len(bk.x) and \
bk.x[n.leaf_label] > bk.k
def is_record(bk: BookKeeping, n: Leaf) -> bool:
"""
Check if index i is representing a record.
We have a record if we start at an index just to
the right of a record letter, i.e., x[i-1] <= k,
and we are sitting on an edge that starts with
the following record letter.
"""
if not is_rec_index(bk, n):
return False
rec = bk.x[n.leaf_label - 1]
if rec > bk.k:
return False
# We started at the first letter in a record,
# so check if we branched at the next rec letter
return n.edge_label[0] == rec + 1
def get_record(bk: BookKeeping, n: Leaf) -> set[int]:
"""Get a set (singleton or empty) of the records in n."""
if n.leaf_label < len(bk.x) and bk.recs[n.leaf_label] is not None:
return {bk.recs[n.leaf_label]}
return set()
def annotate(bk: BookKeeping, n: Node, depth: int = 0) -> None:
"""
Annotates the tree's nodes with the info we need to
collect the record information.
"""
# FIXME: You should probably use an explicit stack here.
# You can easily run out of recursion stack.
n.depth = depth
if isinstance(n, Leaf):
n.is_record = is_record(bk, n)
n.recs = get_record(bk, n) # collect the records represented
n.size = int(is_rec_index(bk, n)) # a real index counts as one
if isinstance(n, Inner):
# Recursively annotate
for child in n.children.values():
annotate(bk, child, depth + len(child.edge_label))
# Compute the size. This will in total run in time the
# size of the suffix tree, which is optimal
n.size = 0
for child in n.children.values():
n.size += child.size
# FIXME: You need something smarter here! This can take
# time O(n) for each of O(nm) inner nodes, where n is the
# number of records. A bit-vector can cut this to O(n/w) where
# w is the word size, but it is still O(n²m/w) and n is large.
n.recs = set.union(*(child.recs for child in n.children.values()))
def collect(bk: BookKeeping,
n: Node,
res: None | dict[int, list[int]] = None
) -> dict[int, list[int]]:
"""Collect which records any given record appear as a sub-string in."""
res = res if res is not None else {}
# FIXME: You should probably use an explicit stack here.
# You can easily run out of recursion stack.
if isinstance(n, Leaf) and n.is_record:
res[bk.x[n.leaf_label-1]] = n.parent.recs
if isinstance(n, Inner):
for child in n.children.values():
collect(bk, child, res)
return res # we return this just for convinience
def count(bk: BookKeeping,
n: Node,
res: None | dict[int, int] = None
) -> dict[int, int]:
"""Count how many records contain each rec as a substring."""
res = res if res is not None else {}
# FIXME: You should probably use an explicit stack here.
# You can easily run out of recursion stack.
if isinstance(n, Leaf) and n.is_record:
res[bk.x[n.leaf_label-1]] = n.parent.size - 1 # don't count itself
if isinstance(n, Inner):
for child in n.children.values():
count(bk, child, res)
return res # we return this just for convinience
def compute_rec_overlap(recs: list[str]) -> dict[int, list[int]]:
"""Compute the overlap summary for recs."""
bk = remap(recs)
st = build_st(bk.x)
annotate(bk, st.root)
summary = collect(bk, st.root)
counts = count(bk, st.root)
return summary, counts
recs = ["foo", "bar", "baz", "foobar", "fooqux", "foobarbaz"]
overlaps, counts = compute_rec_overlap(recs)
for rec, overlaps in overlaps.items():
print(rec, overlaps)
for rec, count in counts.items():
print(rec, count)

Related

Finding all the shortest unique substring which are of same length?

Given a string sequence which contains only four letters, ['a','g','c','t']
for example: agggcttttaaaatttaatttgggccc.
Find all the shortest unique sub-string of the string sequence which are of equal length (the length should be minimum of all the unique sub-strings) ?
For example : aaggcgccttt
answer: ['aa', 'ag', 'gg','cg', 'cc','ct']
explanation:shortest unique sub-string of length 2
I have tried using suffix-arrays coupled with longest common prefix but i am unable to draw the solution perfectly.

I'm not sure what you mean by "minimum unique sub-string", but looking at your example I assume you mean "shortest runs of a single letter". If this is the case, you just need to iterate through the string once (character by character) and count all the shortest runs you find. You should keep track of the length of the minimum run found so far (infinity at start) and the length of the current run.
If you need to find the exact runs, you can add all the minimum runs you find to e.g. a list as you iterate through the string (and modify that list accordingly if a shorter run is found).
EDIT:
I thought more about the problem and came up with the following solution.
We find all the unique sub-strings of length i (in ascending order). So, first we consider all sub-strings of length 1, then all sub-strings of length 2, and so on. If we find any, we stop, since the sub-string length can only increase from this point.
You will have to use a list to keep track of the sub-strings you've seen so far, and a list to store the actual sub-strings. You will also have to maintain them accordingly as you find new sub-strings.
Here's the Java code I came up with, in case you need it:
String str = "aaggcgccttt";
String curr = "";
ArrayList<String> uniqueStrings = new ArrayList<String>();
ArrayList<String> alreadySeen = new ArrayList<String>();
for (int i = 1; i < str.length(); i++) {
for (int j = 0; j < str.length() - i + 1; j++) {
curr = str.substring(j, j + i);
if (!alreadySeen.contains(curr)){ //Sub-string hasn't been seen yet
uniqueStrings.add(curr);
alreadySeen.add(curr);
}
else //Repeated sub-string found
uniqueStrings.remove(curr);
}
if (!uniqueStrings.isEmpty()) //We have found non-repeating sub-string(s)
break;
alreadySeen.clear();
}
//Output
if (uniqueStrings.isEmpty())
System.out.println(str);
else {
for (String s : uniqueStrings)
System.out.println(s);
}
The uniqueStrings list contains all the unique sub-strings of minimum length (used for output). The alreadySeen list keeps track of all the sub-strings that have already been seen (used to exclude repeating sub-strings).

I'll write some code in Python, because that's what I find the easiest.
I actually wrote both the overlapping and the non-overlapping variants. As a bonus, it also checks that the input is valid.
You seems to be interested only in the overlapping variant:
import itertools
def find_all(
text,
pattern,
overlap=False):
"""
Find all occurrencies of the pattern in the text.
Args:
text (str|bytes|bytearray): The input text.
pattern (str|bytes|bytearray): The pattern to find.
overlap (bool): Detect overlapping patterns.
Yields:
position (int): The position of the next finding.
"""
len_text = len(text)
offset = 1 if overlap else (len(pattern) or 1)
i = 0
while i < len_text:
i = text.find(pattern, i)
if i >= 0:
yield i
i += offset
else:
break
def is_valid(text, tokens):
"""
Check if the text only contains the specified tokens.
Args:
text (str|bytes|bytearray): The input text.
tokens (str|bytes|bytearray): The valid tokens for the text.
Returns:
result (bool): The result of the check.
"""
return set(text).issubset(set(tokens))
def shortest_unique_substr(
text,
tokens='acgt',
overlapping=True,
check_valid_input=True):
"""
Find the shortest unique substring.
Args:
text (str|bytes|bytearray): The input text.
tokens (str|bytes|bytearray): The valid tokens for the text.
overlap (bool)
check_valid_input (bool): Check if the input is valid.
Returns:
result (set): The set of the shortest unique substrings.
"""
def add_if_single_match(
text,
pattern,
result,
overlapping):
match_gen = find_all(text, pattern, overlapping)
try:
next(match_gen) # first match
except StopIteration:
# the pattern is not found, nothing to do
pass
else:
try:
next(match_gen)
except StopIteration:
# the pattern was found only once so add to results
result.add(pattern)
else:
# the pattern is found twice, nothing to do
pass
# just some sanity check
if check_valid_input and not is_valid(text, tokens):
raise ValueError('Input text contains invalid tokens.')
result = set()
# shortest sequence cannot be longer than this
if overlapping:
max_lim = len(text) // 2 + 1
max_lim = len(tokens)
for n in range(1, max_lim + 1):
for pattern_gen in itertools.product(tokens, repeat=2):
pattern = ''.join(pattern_gen)
add_if_single_match(text, pattern, result, overlapping)
if len(result) > 0:
break
else:
max_lim = len(tokens)
for n in range(1, max_lim + 1):
for i in range(len(text) - n):
pattern = text[i:i + n]
add_if_single_match(text, pattern, result, overlapping)
if len(result) > 0:
break
return result
After some sanity check for the correctness of the outputs:
shortest_unique_substr_ovl = functools.partial(shortest_unique_substr, overlapping=True)
shortest_unique_substr_ovl.__name__ = 'shortest_unique_substr_ovl'
shortest_unique_substr_not = functools.partial(shortest_unique_substr, overlapping=False)
shortest_unique_substr_not.__name__ = 'shortest_unique_substr_not'
funcs = shortest_unique_substr_ovl, shortest_unique_substr_not
test_inputs = (
'aaa',
'aaaa',
'aaggcgccttt',
'agggcttttaaaatttaatttgggccc',
)
import functools
for func in funcs:
print('Func:', func.__name__)
for test_input in test_inputs:
print(func(test_input))
print()
Func: shortest_unique_substr_ovl
set()
set()
{'cg', 'ag', 'gg', 'ct', 'aa', 'cc'}
{'tg', 'ag', 'ct'}
Func: shortest_unique_substr_not
{'aa'}
{'aaa'}
{'cg', 'tt', 'ag', 'gg', 'ct', 'aa', 'cc'}
{'tg', 'ag', 'ct', 'cc'}
it is wise to benchmark how fast we actually are.
Below you can find some benchmarks, produced using some template code from here (the overlapping variant is in blue):
and the rest of the code for completeness:
def gen_input(n, tokens='acgt'):
return ''.join([tokens[random.randint(0, len(tokens) - 1)] for _ in range(n)])
def equal_output(a, b):
return a == b
input_sizes = tuple(2 ** (1 + i) for i in range(16))
runtimes, input_sizes, labels, results = benchmark(
funcs, gen_input=gen_input, equal_output=equal_output,
input_sizes=input_sizes)
plot_benchmarks(runtimes, input_sizes, labels, units='ms')
plot_benchmarks(runtimes, input_sizes, labels, units='μs', zoom_fastest=2)
As far as the asymptotic time-complexity analysis is concerned, considering only the overlapping case, let N be the input size, let K be the number of tokens (4 in your case), find_all() is O(N), and the body of shortest_unique_substr is O(K²) (+ O((K - 1)²) + O((K - 2)²) + ...).
So, this is overall O(N*K²) or O(N*(Σk²)) (for k = 1, …, K), since K is fixed, this is O(N), as the benchmarks seem to indicate.

How to keep track of depth in breadth first search?

I have a tree as input to the breadth first search and I want to know as the algorithm progresses at which level it is?
# Breadth First Search Implementation
graph = {
'A':['B','C','D'],
'B':['A'],
'C':['A','E','F'],
'D':['A','G','H'],
'E':['C'],
'F':['C'],
'G':['D'],
'H':['D']
}
def breadth_first_search(graph,source):
"""
This function is the Implementation of the breadth_first_search program
"""
# Mark each node as not visited
mark = {}
for item in graph.keys():
mark[item] = 0
queue, output = [],[]
# Initialize an empty queue with the source node and mark it as explored
queue.append(source)
mark[source] = 1
output.append(source)
# while queue is not empty
while queue:
# remove the first element of the queue and call it vertex
vertex = queue[0]
queue.pop(0)
# for each edge from the vertex do the following
for vrtx in graph[vertex]:
# If the vertex is unexplored
if mark[vrtx] == 0:
queue.append(vrtx) # mark it as explored
mark[vrtx] = 1 # and append it to the queue
output.append(vrtx) # fill the output vector
return output
print breadth_first_search(graph, 'A')
It takes tree as an input graph, what I want is, that at each iteration it should print out the current level which is being processed.

Actually, we don't need an extra queue to store the info on the current depth, nor do we need to add null to tell whether it's the end of current level. We just need to know how many nodes the current level contains, then we can deal with all the nodes in the same level, and increase the level by 1 after we are done processing all the nodes on the current level.
int level = 0;
Queue<Node> queue = new LinkedList<>();
queue.add(root);
while(!queue.isEmpty()){
int level_size = queue.size();
while (level_size-- != 0) {
Node temp = queue.poll();
if (temp.right != null) queue.add(temp.right);
if (temp.left != null) queue.add(temp.left);
}
level++;
}

You don't need to use extra queue or do any complicated calculation to achieve what you want to do. This idea is very simple.
This does not use any extra space other than queue used for BFS.
The idea I am going to use is to add null at the end of each level. So the number of nulls you encountered +1 is the depth you are at. (of course after termination it is just level).
int level = 0;
Queue <Node> queue = new LinkedList<>();
queue.add(root);
queue.add(null);
while(!queue.isEmpty()){
Node temp = queue.poll();
if(temp == null){
level++;
queue.add(null);
if(queue.peek() == null) break;// You are encountering two consecutive `nulls` means, you visited all the nodes.
else continue;
}
if(temp.right != null)
queue.add(temp.right);
if(temp.left != null)
queue.add(temp.left);
}

Maintain a queue storing the depth of the corresponding node in BFS queue. Sample code for your information:
queue bfsQueue, depthQueue;
bfsQueue.push(firstNode);
depthQueue.push(0);
while (!bfsQueue.empty()) {
f = bfsQueue.front();
depth = depthQueue.front();
bfsQueue.pop(), depthQueue.pop();
for (every node adjacent to f) {
bfsQueue.push(node), depthQueue.push(depth+1);
}
}
This method is simple and naive, for O(1) extra space you may need the answer post by #stolen_leaves.

Try having a look at this post. It keeps track of the depth using the variable currentDepth
https://stackoverflow.com/a/16923440/3114945
For your implementation, keep track of the left most node and a variable for the depth. Whenever the left most node is popped from the queue, you know you hit a new level and you increment the depth.
So, your root is the leftMostNode at level 0. Then the left most child is the leftMostNode. As soon as you hit it, it becomes level 1. The left most child of this node is the next leftMostNode and so on.

With this Python code you can maintain the depth of each node from the root by increasing the depth only after you encounter a node of new depth in the queue.
queue = deque()
marked = set()
marked.add(root)
queue.append((root,0))
depth = 0
while queue:
r,d = queue.popleft()
if d > depth: # increase depth only when you encounter the first node in the next depth
depth += 1
for node in edges[r]:
if node not in marked:
marked.add(node)
queue.append((node,depth+1))

If your tree is perfectly ballanced (i.e. each node has the same number of children) there's actually a simple, elegant solution here with O(1) time complexity and O(1) space complexity. The main usecase where I find this helpful is in traversing a binary tree, though it's trivially adaptable to other tree sizes.
The key thing to realize here is that each level of a binary tree contains exactly double the quantity of nodes compared to the previous level. This allows us to calculate the total number of nodes in any tree given the tree's depth. For instance, consider the following tree:
This tree has a depth of 3 and 7 total nodes. We don't need to count the number of nodes to figure this out though. We can compute this in O(1) time with the formaula: 2^d - 1 = N, where d is the depth and N is the total number of nodes. (In a ternary tree this is 3^d - 1 = N, and in a tree where each node has K children this is K^d - 1 = N). So in this case, 2^3 - 1 = 7.
To keep track of depth while conducting a breadth first search, we simply need to reverse this calculation. Whereas the above formula allows us to solve for N given d, we actually want to solve for d given N. For instance, say we're evaluating the 5th node. To figure out what depth the 5th node is on, we take the following equation: 2^d - 1 = 5, and then simply solve for d, which is basic algebra:
If d turns out to be anything other than a whole number, just round up (the last node in a row is always a whole number). With that all in mind, I propose the following algorithm to identify the depth of any given node in a binary tree during breadth first traversal:
Let the variable visited equal 0.
Each time a node is visited, increment visited by 1.
Each time visited is incremented, calculate the node's depth as depth = round_up(log2(visited + 1))
You can also use a hash table to map each node to its depth level, though this does increase the space complexity to O(n). Here's a PHP implementation of this algorithm:
<?php
$tree = [
['A', [1,2]],
['B', [3,4]],
['C', [5,6]],
['D', [7,8]],
['E', [9,10]],
['F', [11,12]],
['G', [13,14]],
['H', []],
['I', []],
['J', []],
['K', []],
['L', []],
['M', []],
['N', []],
['O', []],
];
function bfs($tree) {
$queue = new SplQueue();
$queue->enqueue($tree[0]);
$visited = 0;
$depth = 0;
$result = [];
while ($queue->count()) {
$visited++;
$node = $queue->dequeue();
$depth = ceil(log($visited+1, 2));
$result[$depth][] = $node[0];
if (!empty($node[1])) {
foreach ($node[1] as $child) {
$queue->enqueue($tree[$child]);
}
}
}
print_r($result);
}
bfs($tree);
Which prints:
Array
(
[1] => Array
(
[0] => A
)
[2] => Array
(
[0] => B
[1] => C
)
[3] => Array
(
[0] => D
[1] => E
[2] => F
[3] => G
)
[4] => Array
(
[0] => H
[1] => I
[2] => J
[3] => K
[4] => L
[5] => M
[6] => N
[7] => O
)
)

Set a variable cnt and initialize it to the size of the queue cnt=queue.size(), Now decrement cnt each time you do a pop. When cnt gets to 0, increase the depth of your BFS and then set cnt=queue.size() again.

In Java it would be something like this.
The idea is to look at the parent to decide the depth.
//Maintain depth for every node based on its parent's depth
Map<Character,Integer> depthMap=new HashMap<>();
queue.add('A');
depthMap.add('A',0); //this is where you start your search
while(!queue.isEmpty())
{
Character parent=queue.remove();
List<Character> children=adjList.get(parent);
for(Character child :children)
{
if (child.isVisited() == false) {
child.visit(parent);
depthMap.add(child,depthMap.get(parent)+1);//parent's depth + 1
}
}
}

Use a dictionary to keep track of the level (distance from start) of each node when exploring the graph.
Example in Python:
from collections import deque
def bfs(graph, start):
queue = deque([start])
levels = {start: 0}
while queue:
vertex = queue.popleft()
for neighbour in graph[vertex]:
if neighbour in levels:
continue
queue.append(neighbour)
levels[neighbour] = levels[vertex] + 1
return levels

I write a simple and easy to read code in python.
class TreeNode:
def __init__(self, x):
self.val = x
self.left = None
self.right = None
class Solution:
def dfs(self, root):
assert root is not None
queue = [root]
level = 0
while queue:
print(level, [n.val for n in queue if n is not None])
mark = len(queue)
for i in range(mark):
n = queue[i]
if n.left is not None:
queue.append(n.left)
if n.right is not None:
queue.append(n.right)
queue = queue[mark:]
level += 1
Usage,
# [3,9,20,null,null,15,7]
n3 = TreeNode(3)
n9 = TreeNode(9)
n20 = TreeNode(20)
n15 = TreeNode(15)
n7 = TreeNode(7)
n3.left = n9
n3.right = n20
n20.left = n15
n20.right = n7
DFS().dfs(n3)
Result
0 [3]
1 [9, 20]
2 [15, 7]

I don't see this method posted so far, so here's a simple one:
You can "attach" the level to the node. For e.g., in case of a tree, instead of the typical queue<TreeNode*>, use a queue<pair<TreeNode*,int>> and then push the pairs of {node,level}s into it. The root would be pushed in as, q.push({root,0}), its children as q.push({root->left,1}), q.push({root->right,1}) and so on...
We don't need to modify the input, append nulls or even (asymptotically speaking) use any extra space just to track the levels.

Find the longest subsequence containing as many 1 as 0 time O(n)

I had a question in an interview and I couldn't find the optimal solution (and it's quite frustrating lol)
So you have a n-list of 1 and 0.
110000110101110..
The goal is to extract the longest sub sequence containing as many 1 as 0.
Here for example it is "110000110101" or "100001101011" or "0000110101110"
I have an idea for O(n^2), just scanning all possibilities from the beginning to the end, but apparently there is a way to do it in O(n).
Any ideas?
Thanks a lot!

Consider '10110':
Create a variable S. Create array A=[0].
Iterate from first number and add 1 to S if you notice 1 and subtract 1 from S if you notice 0 and append S to A.
For our example sequence A will be: [0, 1, 0, 1, 2, 1]. A is simply an array which stores a difference between number of 1s and 0s preceding the index. The sequence has to start and end at the place which has the same difference between 1s and 0s. So now our task is to find the longest distance between same numbers in A.
Now create 2 empty dictionaries (hash maps) First and Last.
Iterate through A and save position of first occurrence of every number in A in dictionary First.
Iterate through A (starting from the end) and save position of the last occurrence of each number in A in dictionary Last.
So for our example array First will be {0:0, 1:1, 2:4} and Last will be {0:2, 1:5, 2:4}
Now find the key(max_key) for which the difference between corresponding values in First and Last is the largest. This max difference is the length of the subsequence. Subsequence starts at First[max_key] and ends at Last[max_key].
I know it is a bit hard to understand but it has complexity O(n) - four loops, each has complexity N. You can replace dictionaries with arrays of course but it is more complicated then using dictionaries.
Solution in Python.
def find_subsequence(seq):
S = 0
A = [0]
for e in seq:
if e=='1':
S+=1
else:
S-=1
A.append(S)
First = {}
Last = {}
for pos, e in enumerate(A):
if e not in First:
First[e] = pos
for pos, e in enumerate(reversed(A)):
if e not in Last:
Last[e] = len(seq) - pos
max_difference = 0
max_key = None
for key in First:
difference = Last[key] - First[key]
if difference>max_difference:
max_difference = difference
max_key = key
if max_key is None:
return ''
return seq[First[max_key]:Last[max_key]]
find_sequene('10110') # Gives '0110'
find_sequence('1') # gives ''
J.F. Sebastian's code is more optimised.
EXTRA
This problem is related to Maximum subarray problem. Its solution is also based on summing elements from start:
def max_subarray(arr):
max_diff = total = min_total = start = tstart = end = 0
for pos, val in enumerate(arr, 1):
total += val
if min_total > total:
min_total = total
tstart = pos
if total - min_total > max_diff:
max_diff = total - min_total
end = pos
start = tstart
return max_diff, arr[start:end]

Is it possible to convert this recursive solution (to print brackets) to an iterative version?

I need to print the different variations of printing valid tags "<" and ">" given the number of times the tags should appear and below is the solution in python using recursion.
def genBrackets(c):
def genBracketsHelper(r,l,currentString):
if l > r or r == -1 or l == -1:
return
if r == l and r == 0:
print currentString
genBracketsHelper(r,l-1, currentString + '<')
genBracketsHelper(r-1,l, currentString + '>')
return
genBracketsHelper(c, c, '')
#display options with 4 tags
genBrackets(4)
I am having a hard time really understanding this and want to try to convert this into a iterative version but I haven't had any success.
As per this thread: Can every recursion be converted into iteration? - it looks like it should be possible and the only exception appears to be the Ackermann function.
If anyone has any tips on how to see the "stack" maintained in Eclipse - that would also be appreciated.
PS. This is not a homework question - I am just trying to understand recursion-to-iteration conversion better.
Edit by Matthieu M. an example of output for better visualization:
>>> genBrackets(3)
<<<>>>
<<><>>
<<>><>
<><<>>
<><><>

I tried to keep basically the same structure as your code, but using an explicit stack rather than function calls to genBracketsHelper:
def genBrackets(c=1):
# genBracketsStack is a list of tuples, each of which
# represents the arguments to a call of genBracketsHelper
# Push the initial call onto the stack:
genBracketsStack = [(c, c, '')]
# This loop replaces genBracketsHelper itself
while genBracketsStack != []:
# Get the current arguments (now from the stack)
(r, l, currentString) = genBracketsStack.pop()
# Basically same logic as before
if l > r or r == -1 or l == -1:
continue # Acts like return
if r == l and r == 0:
print currentString
# Recursive calls are now pushes onto the stack
genBracketsStack.append((r-1,l, currentString + '>'))
genBracketsStack.append((r,l-1, currentString + '<'))
# This is kept explicit since you had an explicit return before
continue
genBrackets(4)
Note that the conversion I am using relies on all of the recursive calls being at the end of the function; the code would be more complicated if that wasn't the case.

You asked about doing this without a stack.
This algorithm walks the entire solution space, so it does a bit more work than the original versions, but it's basically the same concept:
each string has a tree of possible suffixes in your grammar
since there are only two tokens, it's a binary tree
the depth of the tree will always be c*2, so...
there must be 2**(c*2) paths through the tree
Since each path is a sequence of binary decisions, the paths correspond to the binary representations of the integers between 0 and 2**(c*2)-1.
So: just loop through those numbers and see if the binary representation corresponds to a balanced string. :)
def isValid(string):
"""
True if and only if the string is balanced.
"""
count = { '<': 0, '>':0 }
for char in string:
count[char] += 1
if count['>'] > count['<']:
return False # premature closure
if count['<'] != count['>']:
return False # unbalanced
else:
return True
def genBrackets(c):
"""
Generate every possible combination and test each one.
"""
for i in range(0, 2**(c*2)):
candidate = bin(i)[2:].zfill(8).replace('0','<').replace('1','>')
if isValid(candidate):
print candidate

In general, a recursion creates a Tree of calls, the root being the original call, and the leaves being the calls that do not recurse.
A degenerate case is when a each call only perform one other call, in this case the tree degenerates into a simple list. The transformation into an iteration is then simply achieved by using a stack, as demonstrated by #Jeremiah.
In the more general case, as here, when each call perform more (strictly) than one call. You obtain a real tree, and there are, therefore, several ways to traverse it.
If you use a queue, instead of a stack, you are performing a breadth-first traversal. #Jeremiah presented a traversal for which I know no name. The typical "recursion" order is normally a depth-first traversal.
The main advantage of the typical recursion is that the length of the stack does not grow as much, so you should aim for depth-first in general... if the complexity does not overwhelm you :)
I suggest beginning by writing a depth first traversal of a tree, once this is done adapting it to your algorithm should be fairly simple.
EDIT: Since I had some time, I wrote the Python Tree Traversal, it's the canonical example:
class Node:
def __init__(self, el, children):
self.element = el
self.children = children
def __repr__(self):
return 'Node(' + str(self.element) + ', ' + str(self.children) + ')'
def depthFirstRec(node):
print node.element
for c in node.children: depthFirstRec(c)
def depthFirstIter(node):
stack = [([node,], 0), ]
while stack != []:
children, index = stack.pop()
if index >= len(children): continue
node = children[index]
print node.element
stack.append((children, index+1))
stack.append((node.children, 0))
Note that the stack management is slightly complicated by the need to remember the index of the child we were currently visiting.
And the adaptation of the algorithm following the depth-first order:
def generateBrackets(c):
# stack is a list of pairs children/index
stack = [([(c,c,''),], 0), ]
while stack != []:
children, index = stack.pop()
if index >= len(children): continue # no more child to visit at this level
stack.append((children, index+1)) # register next child at this level
l, r, current = children[index]
if r == 0 and l == 0: print current
# create the list of children of this node
# (bypass if we are already unbalanced)
if l > r: continue
newChildren = []
if l != 0: newChildren.append((l-1, r, current + '<'))
if r != 0: newChildren.append((l, r-1, current + '>'))
stack.append((newChildren, 0))
I just realized that storing the index is a bit "too" complicated, since I never visit back. The simple solution thus consists in removing the list elements I don't need any longer, treating the list as a queue (in fact, a stack could be sufficient)!
This applies with minimum transformation.
def generateBrackets2(c):
# stack is a list of queues of children
stack = [[(c,c,''),], ]
while stack != []:
children = stack.pop()
if children == []: continue # no more child to visit at this level
stack.append(children[1:]) # register next child at this level
l, r, current = children[0]
if r == 0 and l == 0: print current
# create the list of children of this node
# (bypass if we are already unbalanced)
if l > r: continue
newChildren = []
if l != 0: newChildren.append((l-1, r, current + '<'))
if r != 0: newChildren.append((l, r-1, current + '>'))
stack.append(newChildren)

Yes.
def genBrackets(c):
stack = [(c, c, '')]
while stack:
right, left, currentString = stack.pop()
if left > right or right == -1 or left == -1:
pass
elif right == left and right == 0:
print currentString
else:
stack.append((right, left-1, currentString + '<'))
stack.append((right-1, left, currentString + '>'))
The output order is different, but the results should be the same.

Select k random elements from a list whose elements have weights

Selecting without any weights (equal probabilities) is beautifully described here.
I was wondering if there is a way to convert this approach to a weighted one.
I am also interested in other approaches as well.
Update: Sampling without replacement

If the sampling is with replacement, you can use this algorithm (implemented here in Python):
import random
items = [(10, "low"),
(100, "mid"),
(890, "large")]
def weighted_sample(items, n):
total = float(sum(w for w, v in items))
i = 0
w, v = items[0]
while n:
x = total * (1 - random.random() ** (1.0 / n))
total -= x
while x > w:
x -= w
i += 1
w, v = items[i]
w -= x
yield v
n -= 1
This is O(n + m) where m is the number of items.
Why does this work? It is based on the following algorithm:
def n_random_numbers_decreasing(v, n):
"""Like reversed(sorted(v * random() for i in range(n))),
but faster because we avoid sorting."""
while n:
v *= random.random() ** (1.0 / n)
yield v
n -= 1
The function weighted_sample is just this algorithm fused with a walk of the items list to pick out the items selected by those random numbers.
This in turn works because the probability that n random numbers 0..v will all happen to be less than z is P = (z/v)n. Solve for z, and you get z = vP1/n. Substituting a random number for P picks the largest number with the correct distribution; and we can just repeat the process to select all the other numbers.
If the sampling is without replacement, you can put all the items into a binary heap, where each node caches the total of the weights of all items in that subheap. Building the heap is O(m). Selecting a random item from the heap, respecting the weights, is O(log m). Removing that item and updating the cached totals is also O(log m). So you can pick n items in O(m + n log m) time.
(Note: "weight" here means that every time an element is selected, the remaining possibilities are chosen with probability proportional to their weights. It does not mean that elements appear in the output with a likelihood proportional to their weights.)
Here's an implementation of that, plentifully commented:
import random
class Node:
# Each node in the heap has a weight, value, and total weight.
# The total weight, self.tw, is self.w plus the weight of any children.
__slots__ = ['w', 'v', 'tw']
def __init__(self, w, v, tw):
self.w, self.v, self.tw = w, v, tw
def rws_heap(items):
# h is the heap. It's like a binary tree that lives in an array.
# It has a Node for each pair in `items`. h[1] is the root. Each
# other Node h[i] has a parent at h[i>>1]. Each node has up to 2
# children, h[i<<1] and h[(i<<1)+1]. To get this nice simple
# arithmetic, we have to leave h[0] vacant.
h = [None] # leave h[0] vacant
for w, v in items:
h.append(Node(w, v, w))
for i in range(len(h) - 1, 1, -1): # total up the tws
h[i>>1].tw += h[i].tw # add h[i]'s total to its parent
return h
def rws_heap_pop(h):
gas = h[1].tw * random.random() # start with a random amount of gas
i = 1 # start driving at the root
while gas >= h[i].w: # while we have enough gas to get past node i:
gas -= h[i].w # drive past node i
i <<= 1 # move to first child
if gas >= h[i].tw: # if we have enough gas:
gas -= h[i].tw # drive past first child and descendants
i += 1 # move to second child
w = h[i].w # out of gas! h[i] is the selected node.
v = h[i].v
h[i].w = 0 # make sure this node isn't chosen again
while i: # fix up total weights
h[i].tw -= w
i >>= 1
return v
def random_weighted_sample_no_replacement(items, n):
heap = rws_heap(items) # just make a heap...
for i in range(n):
yield rws_heap_pop(heap) # and pop n items off it.

If the sampling is with replacement, use the roulette-wheel selection technique (often used in genetic algorithms):
sort the weights
compute the cumulative weights
pick a random number in [0,1]*totalWeight
find the interval in which this number falls into
select the elements with the corresponding interval
repeat k times
If the sampling is without replacement, you can adapt the above technique by removing the selected element from the list after each iteration, then re-normalizing the weights so that their sum is 1 (valid probability distribution function)

I know this is a very old question, but I think there's a neat trick to do this in O(n) time if you apply a little math!
The exponential distribution has two very useful properties.
Given n samples from different exponential distributions with different rate parameters, the probability that a given sample is the minimum is equal to its rate parameter divided by the sum of all rate parameters.
It is "memoryless". So if you already know the minimum, then the probability that any of the remaining elements is the 2nd-to-min is the same as the probability that if the true min were removed (and never generated), that element would have been the new min. This seems obvious, but I think because of some conditional probability issues, it might not be true of other distributions.
Using fact 1, we know that choosing a single element can be done by generating these exponential distribution samples with rate parameter equal to the weight, and then choosing the one with minimum value.
Using fact 2, we know that we don't have to re-generate the exponential samples. Instead, just generate one for each element, and take the k elements with lowest samples.
Finding the lowest k can be done in O(n). Use the Quickselect algorithm to find the k-th element, then simply take another pass through all elements and output all lower than the k-th.
A useful note: if you don't have immediate access to a library to generate exponential distribution samples, it can be easily done by: -ln(rand())/weight

I've done this in Ruby
https://github.com/fl00r/pickup
require 'pickup'
pond = {
"selmon" => 1,
"carp" => 4,
"crucian" => 3,
"herring" => 6,
"sturgeon" => 8,
"gudgeon" => 10,
"minnow" => 20
}
pickup = Pickup.new(pond, uniq: true)
pickup.pick(3)
#=> [ "gudgeon", "herring", "minnow" ]
pickup.pick
#=> "herring"
pickup.pick
#=> "gudgeon"
pickup.pick
#=> "sturgeon"

If you want to generate large arrays of random integers with replacement, you can use piecewise linear interpolation. For example, using NumPy/SciPy:
import numpy
import scipy.interpolate
def weighted_randint(weights, size=None):
"""Given an n-element vector of weights, randomly sample
integers up to n with probabilities proportional to weights"""
n = weights.size
# normalize so that the weights sum to unity
weights = weights / numpy.linalg.norm(weights, 1)
# cumulative sum of weights
cumulative_weights = weights.cumsum()
# piecewise-linear interpolating function whose domain is
# the unit interval and whose range is the integers up to n
f = scipy.interpolate.interp1d(
numpy.hstack((0.0, weights)),
numpy.arange(n + 1), kind='linear')
return f(numpy.random.random(size=size)).astype(int)
This is not effective if you want to sample without replacement.

Here's a Go implementation from geodns:
package foo
import (
"log"
"math/rand"
)
type server struct {
Weight int
data interface{}
}
func foo(servers []server) {
// servers list is already sorted by the Weight attribute
// number of items to pick
max := 4
result := make([]server, max)
sum := 0
for _, r := range servers {
sum += r.Weight
}
for si := 0; si < max; si++ {
n := rand.Intn(sum + 1)
s := 0
for i := range servers {
s += int(servers[i].Weight)
if s >= n {
log.Println("Picked record", i, servers[i])
sum -= servers[i].Weight
result[si] = servers[i]
// remove the server from the list
servers = append(servers[:i], servers[i+1:]...)
break
}
}
}
return result
}

If you want to pick x elements from a weighted set without replacement such that elements are chosen with a probability proportional to their weights:
import random
def weighted_choose_subset(weighted_set, count):
"""Return a random sample of count elements from a weighted set.
weighted_set should be a sequence of tuples of the form
(item, weight), for example: [('a', 1), ('b', 2), ('c', 3)]
Each element from weighted_set shows up at most once in the
result, and the relative likelihood of two particular elements
showing up is equal to the ratio of their weights.
This works as follows:
1.) Line up the items along the number line from [0, the sum
of all weights) such that each item occupies a segment of
length equal to its weight.
2.) Randomly pick a number "start" in the range [0, total
weight / count).
3.) Find all the points "start + n/count" (for all integers n
such that the point is within our segments) and yield the set
containing the items marked by those points.
Note that this implementation may not return each possible
subset. For example, with the input ([('a': 1), ('b': 1),
('c': 1), ('d': 1)], 2), it may only produce the sets ['a',
'c'] and ['b', 'd'], but it will do so such that the weights
are respected.
This implementation only works for nonnegative integral
weights. The highest weight in the input set must be less
than the total weight divided by the count; otherwise it would
be impossible to respect the weights while never returning
that element more than once per invocation.
"""
if count == 0:
return []
total_weight = 0
max_weight = 0
borders = []
for item, weight in weighted_set:
if weight < 0:
raise RuntimeError("All weights must be positive integers")
# Scale up weights so dividing total_weight / count doesn't truncate:
weight *= count
total_weight += weight
borders.append(total_weight)
max_weight = max(max_weight, weight)
step = int(total_weight / count)
if max_weight > step:
raise RuntimeError(
"Each weight must be less than total weight / count")
next_stop = random.randint(0, step - 1)
results = []
current = 0
for i in range(count):
while borders[current] <= next_stop:
current += 1
results.append(weighted_set[current][0])
next_stop += step
return results

In the question you linked to, Kyle's solution would work with a trivial generalization.
Scan the list and sum the total weights. Then the probability to choose an element should be:
1 - (1 - (#needed/(weight left)))/(weight at n). After visiting a node, subtract it's weight from the total. Also, if you need n and have n left, you have to stop explicitly.
You can check that with everything having weight 1, this simplifies to kyle's solution.
Edited: (had to rethink what twice as likely meant)

This one does exactly that with O(n) and no excess memory usage. I believe this is a clever and efficient solution easy to port to any language. The first two lines are just to populate sample data in Drupal.
function getNrandomGuysWithWeight($numitems){
$q = db_query('SELECT id, weight FROM theTableWithTheData');
$q = $q->fetchAll();
$accum = 0;
foreach($q as $r){
$accum += $r->weight;
$r->weight = $accum;
}
$out = array();
while(count($out) < $numitems && count($q)){
$n = rand(0,$accum);
$lessaccum = NULL;
$prevaccum = 0;
$idxrm = 0;
foreach($q as $i=>$r){
if(($lessaccum == NULL) && ($n <= $r->weight)){
$out[] = $r->id;
$lessaccum = $r->weight- $prevaccum;
$accum -= $lessaccum;
$idxrm = $i;
}else if($lessaccum){
$r->weight -= $lessaccum;
}
$prevaccum = $r->weight;
}
unset($q[$idxrm]);
}
return $out;
}

I putting here a simple solution for picking 1 item, you can easily expand it for k items (Java style):
double random = Math.random();
double sum = 0;
for (int i = 0; i < items.length; i++) {
val = items[i];
sum += val.getValue();
if (sum > random) {
selected = val;
break;
}
}

I have implemented an algorithm similar to Jason Orendorff's idea in Rust here. My version additionally supports bulk operations: insert and remove (when you want to remove a bunch of items given by their ids, not through the weighted selection path) from the data structure in O(m + log n) time where m is the number of items to remove and n the number of items in stored.

Sampling wihout replacement with recursion - elegant and very short solution in c#
//how many ways we can choose 4 out of 60 students, so that every time we choose different 4
class Program
{
static void Main(string[] args)
{
int group = 60;
int studentsToChoose = 4;
Console.WriteLine(FindNumberOfStudents(studentsToChoose, group));
}
private static int FindNumberOfStudents(int studentsToChoose, int group)
{
if (studentsToChoose == group || studentsToChoose == 0)
return 1;
return FindNumberOfStudents(studentsToChoose, group - 1) + FindNumberOfStudents(studentsToChoose - 1, group - 1);
}
}

I just spent a few hours trying to get behind the algorithms underlying sampling without replacement out there and this topic is more complex than I initially thought. That's exciting! For the benefit of a future readers (have a good day!) I document my insights here including a ready to use function which respects the given inclusion probabilities further below. A nice and quick mathematical overview of the various methods can be found here: Tillé: Algorithms of sampling with equal or unequal probabilities. For example Jason's method can be found on page 46. The caveat with his method is that the weights are not proportional to the inclusion probabilities as also noted in the document. Actually, the i-th inclusion probabilities can be recursively computed as follows:
def inclusion_probability(i, weights, k):
"""
Computes the inclusion probability of the i-th element
in a randomly sampled k-tuple using Jason's algorithm
(see https://stackoverflow.com/a/2149533/7729124)
"""
if k <= 0: return 0
cum_p = 0
for j, weight in enumerate(weights):
# compute the probability of j being selected considering the weights
p = weight / sum(weights)
if i == j:
# if this is the target element, we don't have to go deeper,
# since we know that i is included
cum_p += p
else:
# if this is not the target element, than we compute the conditional
# inclusion probability of i under the constraint that j is included
cond_i = i if i < j else i-1
cond_weights = weights[:j] + weights[j+1:]
cond_p = inclusion_probability(cond_i, cond_weights, k-1)
cum_p += p * cond_p
return cum_p
And we can check the validity of the function above by comparing
In : for i in range(3): print(i, inclusion_probability(i, [1,2,3], 2))
0 0.41666666666666663
1 0.7333333333333333
2 0.85
to
In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: random_weighted_sample_no_replacement([(1,'a'),(2,'b'),(3,'c')],2))
Out: Counter({'a': 4198, 'b': 7268, 'c': 8534})
One way - also suggested in the document above - to specify the inclusion probabilities is to compute the weights from them. The whole complexity of the question at hand stems from the fact that one cannot do that directly since one basically has to invert the recursion formula, symbolically I claim this is impossible. Numerically it can be done using all kind of methods, e.g. Newton's method. However the complexity of inverting the Jacobian using plain Python becomes unbearable quickly, I really recommend looking into numpy.random.choice in this case.
Luckily there is method using plain Python which might or might not be sufficiently performant for your purposes, it works great if there aren't that many different weights. You can find the algorithm on page 75&76. It works by splitting up the sampling process into parts with the same inclusion probabilities, i.e. we can use random.sample again! I am not going to explain the principle here since the basics are nicely presented on page 69. Here is the code with hopefully a sufficient amount of comments:
def sample_no_replacement_exact(items, k, best_effort=False, random_=None, ε=1e-9):
"""
Returns a random sample of k elements from items, where items is a list of
tuples (weight, element). The inclusion probability of an element in the
final sample is given by
k * weight / sum(weights).
Note that the function raises if a inclusion probability cannot be
satisfied, e.g the following call is obviously illegal:
sample_no_replacement_exact([(1,'a'),(2,'b')],2)
Since selecting two elements means selecting both all the time,
'b' cannot be selected twice as often as 'a'. In general it can be hard to
spot if the weights are illegal and the function does *not* always raise
an exception in that case. To remedy the situation you can pass
best_effort=True which redistributes the inclusion probability mass
if necessary. Note that the inclusion probabilities will change
if deemed necessary.
The algorithm is based on the splitting procedure on page 75/76 in:
http://www.eustat.eus/productosServicios/52.1_Unequal_prob_sampling.pdf
Additional information can be found here:
https://stackoverflow.com/questions/2140787/
:param items: list of tuples of type weight,element
:param k: length of resulting sample
:param best_effort: fix inclusion probabilities if necessary,
(optional, defaults to False)
:param random_: random module to use (optional, defaults to the
standard random module)
:param ε: fuzziness parameter when testing for zero in the context
of floating point arithmetic (optional, defaults to 1e-9)
:return: random sample set of size k
:exception: throws ValueError in case of bad parameters,
throws AssertionError in case of algorithmic impossibilities
"""
# random_ defaults to the random submodule
if not random_:
random_ = random
# special case empty return set
if k <= 0:
return set()
if k > len(items):
raise ValueError("resulting tuple length exceeds number of elements (k > n)")
# sort items by weight
items = sorted(items, key=lambda item: item[0])
# extract the weights and elements
weights, elements = list(zip(*items))
# compute the inclusion probabilities (short: π) of the elements
scaling_factor = k / sum(weights)
π = [scaling_factor * weight for weight in weights]
# in case of best_effort: if a inclusion probability exceeds 1,
# try to rebalance the probabilities such that:
# a) no probability exceeds 1,
# b) the probabilities still sum to k, and
# c) the probability masses flow from top to bottom:
# [0.2, 0.3, 1.5] -> [0.2, 0.8, 1]
# (remember that π is sorted)
if best_effort and π[-1] > 1 + ε:
# probability mass we still we have to distribute
debt = 0.
for i in reversed(range(len(π))):
if π[i] > 1.:
# an 'offender', take away excess
debt += π[i] - 1.
π[i] = 1.
else:
# case π[i] < 1, i.e. 'save' element
# maximum we can transfer from debt to π[i] and still not
# exceed 1 is computed by the minimum of:
# a) 1 - π[i], and
# b) debt
max_transfer = min(debt, 1. - π[i])
debt -= max_transfer
π[i] += max_transfer
assert debt < ε, "best effort rebalancing failed (impossible)"
# make sure we are talking about probabilities
if any(not (0 - ε <= π_i <= 1 + ε) for π_i in π):
raise ValueError("inclusion probabilities not satisfiable: {}" \
.format(list(zip(π, elements))))
# special case equal probabilities
# (up to fuzziness parameter, remember that π is sorted)
if π[-1] < π[0] + ε:
return set(random_.sample(elements, k))
# compute the two possible lambda values, see formula 7 on page 75
# (remember that π is sorted)
λ1 = π[0] * len(π) / k
λ2 = (1 - π[-1]) * len(π) / (len(π) - k)
λ = min(λ1, λ2)
# there are two cases now, see also page 69
# CASE 1
# with probability λ we are in the equal probability case
# where all elements have the same inclusion probability
if random_.random() < λ:
return set(random_.sample(elements, k))
# CASE 2:
# with probability 1-λ we are in the case of a new sample without
# replacement problem which is strictly simpler,
# it has the following new probabilities (see page 75, π^{(2)}):
new_π = [
(π_i - λ * k / len(π))
/
(1 - λ)
for π_i in π
]
new_items = list(zip(new_π, elements))
# the first few probabilities might be 0, remove them
# NOTE: we make sure that floating point issues do not arise
# by using the fuzziness parameter
while new_items and new_items[0][0] < ε:
new_items = new_items[1:]
# the last few probabilities might be 1, remove them and mark them as selected
# NOTE: we make sure that floating point issues do not arise
# by using the fuzziness parameter
selected_elements = set()
while new_items and new_items[-1][0] > 1 - ε:
selected_elements.add(new_items[-1][1])
new_items = new_items[:-1]
# the algorithm reduces the length of the sample problem,
# it is guaranteed that:
# if λ = λ1: the first item has probability 0
# if λ = λ2: the last item has probability 1
assert len(new_items) < len(items), "problem was not simplified (impossible)"
# recursive call with the simpler sample problem
# NOTE: we have to make sure that the selected elements are included
return sample_no_replacement_exact(
new_items,
k - len(selected_elements),
best_effort=best_effort,
random_=random_,
ε=ε
) | selected_elements
Example:
In : sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c')],2)
Out: {'b', 'c'}
In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c'),(4,'d')],2))
Out: Counter({'a': 2048, 'b': 4051, 'c': 5979, 'd': 7922})
The weights sum up to 10, hence the inclusion probabilities compute to: a → 20%, b → 40%, c → 60%, d → 80%. (Sum: 200% = k.) It works!
Just one word of caution for the productive use of this function, it can be very hard to spot illegal inputs for the weights. An obvious illegal example is
In: sample_no_replacement_exact([(1,'a'),(2,'b')],2)
ValueError: inclusion probabilities not satisfiable: [(0.6666666666666666, 'a'), (1.3333333333333333, 'b')]
b cannot appear twice as often as a since both have to be always be selected. There are more subtle examples. To avoid an exception in production just use best_effort=True, which rebalances the inclusion probability mass such that we have always a valid distribution. Obviously this might change the inclusion probabilities.

I used a associative map (weight,object). for example:
{
(10,"low"),
(100,"mid"),
(10000,"large")
}
total=10110
peek a random number between 0 and 'total' and iterate over the keys until this number fits in a given range.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Reduce time in searching a list of strings with Regex - algorithm

Related

Finding all the shortest unique substring which are of same length?

How to keep track of depth in breadth first search?

Find the longest subsequence containing as many 1 as 0 time O(n)

Is it possible to convert this recursive solution (to print brackets) to an iterative version?

Select k random elements from a list whose elements have weights

Categories

Resources