I have 10 million documents. Each document is a set of tokens (about 100 unique tokens) regardless of their frequency (bag-of-words). All unique tokens in all documents form the vocabulary V. The size of V is roughly 50000.
Requirement 1:
Given a set of tokens, say (t1, t2) (there may be more than two tokens in practice) where t1,t2 in V, I want to find all tokens that co-occur with t1,t2 in at least one document. That is to say, find all tokens u, satisfying u in V and t1,t2,u constituting a subset of at least one document.
Requirement 2:
Given a set of tokens, find documents that contain these tokens.
Is there any efficient data structure to fulfill my requirements? "Efficienct" means avoid iterating all documents.
You can use a map (usually implemented as either a hashmap or a binary tree) that maps each token to the set of documents that contain this token.
When given two tokens t1,t2, compute the intersection of the two corresponding sets of documents. This gives you the set of documents that contain both t1 and t2. Then return the union of all the tokens contained in these documents.
In python:
from collections import defaultdict
def build_map(documents):
m = defaultdict(set)
for i, document in enumerate(documents):
for token in document:
m[token].add(i)
return m
def cooccuring_with_pair(m, documents, t1, t2):
doc_ids = set.intersection(m[t1], m[t2])
return set().union(*(documents[i] for i in doc_ids))
Testing the python code with a small example:
documents = [set(s.split()) for s in ('the quick brown fox jumps over the lazy dog', 'the grey cat jumps in fright when hearing the dog', 'two brown mice are chased by a dog', 'the quick brown fox grows old and slow')]
m = build_map(documents)
print(m)
# defaultdict(<class 'set'>, {'brown': {0, 2, 3}, 'lazy': {0}, 'fox': {0, 3}, 'the': {0, 1, 3}, 'over': {0}, 'jumps': {0, 1}, 'dog': {0, 1, 2}, 'quick': {0, 3}, 'cat': {1}, 'hearing': {1}, 'fright': {1}, 'grey': {1}, 'when': {1}, 'in': {1}, 'mice': {2}, 'by': {2}, 'a': {2}, 'chased': {2}, 'two': {2}, 'are': {2}, 'and': {3}, 'slow': {3}, 'grows': {3}, 'old': {3}})
for t1, t2 in [('jumps', 'dog'), ('brown', 'fox'), ('cat', 'fox')]:
print(t1, t2)
print(cooccuring_with_pair(m, documents, t1, t2))
print()
# jumps dog
# {'brown', 'lazy', 'fox', 'cat', 'the', 'hearing', 'over', 'jumps', 'dog', 'fright', 'grey', 'when', 'quick', 'in'}
#
# brown fox
# {'brown', 'lazy', 'and', 'fox', 'the', 'slow', 'grows', 'old', 'over', 'jumps', 'dog', 'quick'}
#
# cat fox
# set()
I am writing a demonstration for a digital GUI for analog filter design. Since demonstrations only allows for one Manipulate function, is there any way to dynamically update my Manipulate controls?
E.x. I have 4 different filter types (Lowpass, Highpass, Bandpass, Bandstop), the former two only require two frequency inputs while the latter two require four frequency inputs. Is there a way to switch between two Manipulate sliders and four based on which mode was selected without nesting Manipulates? Alternatively can I have all four and grey out two when they are not needed?
Here is an example of dynamically changing Manipulate controls that should be easy to modify to achieve what you want. I did not write it, and I do not remember where I saw it.
Manipulate[
{x, yyy},
{{x, a}, {a, b, c, d}, None},
{{yyy, 0.5}, 0, 1, None},
{{type, 1}, Range#3, None},
PaneSelector[{
1 -> Column[{
Control#{x, {a, b, c, d}, RadioButtonBar},
Control#{{yyy, 0.5}, 0, 1},
Control#{type, Range#3}
}],
2 -> Column[{
Control#{x, {a, b, c, d}, SetterBar},
Control#{yyy},
Control#{type, Range#3}
}],
3 -> Column[{
Control#{x, {a, b, c, d}, PopupMenu},
Control#{{yyy, 0.5}, 0, 1},
Control#{type, Range#3}
}]
}, Dynamic#type]
]
Given a pair of numbers (A, B).
You can perform an operation (A + B, B) or (A, A + B).
(A, B) is initialized to (1, 1).
For any N > 0, find the minimum number of operations you need to perform on (A, B) until A = N or B = N
Came across this question in an interview summary on glassdoor. Thought through a couple approaches, searched online but couldn't find any articles/answers solving this question. I have a brute force method shown below, however it must traverse O(2^N) paths, wondering if there is an elegant solution I am not seeing.
def pairsum(N):
A = 1
B = 1
return helper(N, A, B, 0)
def helper(N, A, B, ops):
# Solution found
if A == N or B == N:
return ops
# We've gone over, invalid path taken
if A > N or B > N:
return float("inf")
return min(helper(N, A + B, B, ops + 1), helper(N, A, A + B, ops + 1))
Given a target number N, it's possible to compute the minimum number of operations in approximately O(N log(N)) basic arithmetic operations (though I suspect there are faster ways). Here's how:
For this problem, I think it's easier to work backwards than forwards. Suppose that we're trying to reach a target pair (a, b) of positive integers. We start with (a, b) and work backwards towards (1, 1), counting steps as we go. The reason that this is easy is that there's only ever a single path from a pair (a, b) back to (1, 1): if a > b, then the pair (a, b) can't be the result of the second operation, so the only way we can possibly reach this pair is by applying the first operation to (a - b, b). Similarly, if a < b, we can only have reached the pair via the second operation applied to (a, b - a). What about the case a = b? Well, if a = b = 1, there's nothing to do. If a = b and a > 1, then there's no way we can reach the pair at all: note that both operations take coprime pairs of integers to coprime pairs of integers, so if we start with (1, 1), we can never reach a pair of integers that has a greatest common divisor bigger than 1.
This leads to the following code to count the number of steps to get from (1, 1) to (a, b), for any pair of positive integers a and b:
def steps_to_reach(a, b):
"""
Given a pair of positive integers, return the number of steps required
to reach that pair from (1, 1), or None if no path exists.
"""
steps = 0
while True:
if a > b:
a -= b
elif b > a:
b -= a
elif a == 1: # must also have b == 1 here
break
else:
return None # no path, gcd(a, b) > 1
steps += 1
return steps
Looking at the code above, it bears a strong resemblance to the Euclidean algorithm for computing greatest common divisors, except that we're doing things very inefficiently, by using repeated subtractions instead of going directly to the remainder with a Euclidean division step. So it's possible to replace the above with the following equivalent, simpler, faster version:
def steps_to_reach_fast(a, b):
"""
Given a pair of positive integers, return the number of steps required
to reach that pair from (1, 1), or None if no path exists.
Faster version of steps_to_reach.
"""
steps = -1
while b:
a, (q, b) = b, divmod(a, b)
steps += q
return None if a > 1 else steps
I leave it to you to check that the two pieces of code are equivalent: it's not hard to prove, but if you don't feel like getting out pen and paper then a quick check at the prompt should be convincing:
>>> all(steps_to_reach(a, b) == steps_to_reach_fast(a, b) for a in range(1, 1001) for b in range(1, 1001))
True
The call steps_to_reach_fast(a, b) needs O(log(max(a, b))) arithmetic operations. (This follows from standard analysis of the Euclidean algorithm.)
Now it's straightfoward to find the minimum number of operations for a given n:
def min_steps_to_reach(n):
"""
Find the minimum number of steps to reach a pair (*, n) or (n, *).
"""
# Count steps in all paths to (n, a). By symmetry, no need to
# check (a, n) too.
all_steps = (steps_to_reach_fast(n, a) for a in range(1, n+1))
return min(steps for steps in all_steps if steps is not None)
This function runs reasonably quickly up to n = 1000000 or so. Let's print out the first few values:
>>> min_steps_to_reach(10**6) # takes ~1 second on my laptop
30
>>> [min_steps_to_reach(n) for n in range(1, 50)]
[0, 1, 2, 3, 3, 5, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 6, 7, 7, 7, 7, 7, 7, 8, 7, 7, 7, 8, 8, 7, 8, 8, 8, 9, 8, 8, 8, 9, 8, 8, 8, 8, 8, 9, 8]
A search at the Online Encyclopedia of Integer Sequences quickly yields the sequence A178047, which matches our sequence perfectly. The sequence is described as follows:
Consider the Farey tree A006842/A006843; a(n) = row at which the
denominator n first appears (assumes first row is labeled row 0).
And indeed, if you look at the tree generated by your two operations, starting at (1, 1), and you regard each pair as a fraction, you get something that's very similar to the Stern-Brocot tree (another name for the Farey tree): the contents of each row are the same, but the ordering within each row is different. As it turns out, it's the Stern-Brocot tree in disguise!
This observation gives us an easily computable lower-bound on min_steps_to_reach: it's easy to show that the largest integer appearing as either a numerator or denominator in the ith row of the Stern-Brocot tree is the i+2nd Fibonacci number. So if n > Fib(i+2), then min_steps_to_reach(n) > i (and if n == Fib(i+2), then min_steps_to_reach(n) is exactly i). Getting an upper bound (or an exact value without an exhaustive search) seems to be a bit harder. Here are the worst cases: for each integer s >= 0, the smallest n requiring s steps (so for example, 506 is the first number requiring 15 steps):
[1, 2, 3, 4, 7, 6, 14, 20, 28, 38, 54, 90, 150, 216, 350, 506, 876, 1230, 2034, 3160, 4470, 7764]
If there's a pattern here, I'm not spotting it (but it's essentially sequence A135510 on OEIS).
[I wrote this before I realized #mark-dickinson had answered; his answer is much better than mine, but I'm providing mine for reference anyway]
The problem is fairly easy to solve if you work backwards. As an example, suppose N=65:
That means our current pair is either {65, x} or {y, 65} for some unknown values of x and y.
If {A,B} was the previous pair, this means either {A, A+B} or {A+B, B} is equal to either {65, x} or {y, 65}, which gives us 4 possible cases:
{A,A+B} = {65,x}, which would mean A=65. However, if A=65, we would've already hit A=N at an earlier step, and we're assuming this is the first step at which A=N or B=N, so we discard this possibility.
{A,A+B} = {y,65} which means A+B=65
{A+B,B} = {65,x} which means A+B=65
{A+B,B} = {y,65} which means B=65. However, if B=65, we already had a solution at a previous step, we also discard this possibility.
Therefore, A+B=65. There are 65 ways in which this can happen (actually, I think you can ignore the cases where A=0 or B=0, and also choose B>A by symmetry, but the solution is easy even withouth these assumptions).
We now examine all 65 cases. As an example, let's use A=25 and B=40.
If {C,D} was the pair that generated {25,40}, there are two possible cases:
{C+D,D} = {25,40} so D=40 and C=-15, which is impossible, since, starting at {1,1}, we will never get negative numbers.
{C,C+D} = {25,40} so C=25, and D=15.
Therefore, the "predecessor" of {25,40} is necessarily {25,15}.
By similar analysis, the predecessor of {25,15}, let's call it {E,F}, must have the property that either:
{E,E+F} = {25,15}, impossible since this would mean F=-10
{E+F,F} = {25,15} meaning E=10 and F=15.
Similarly the predecessor of {10,15} is {10,5}, whose predecessor is {5,5}.
The predecessor of {5,5} is either {0,5} or {5,0}. These two pairs are their own predecessors, but have no other predecessors.
Since we never hit {1,1} in this sequence, we know that {1,1} will never generate {25, 40}, so we continue computing for other pairs {A,B} such that A+B=65.
If we did hit {1,1}, we'd count the number of steps it took to get there, store the value, compute it for all other values of {A,B} such that A+B=65, and take the minimum.
Note that once we've chosen a value of A (and thus a value of B), we are effectively doing the subtraction version of Euclid's Algorithm, so the number of steps required is O(log(N)). Since you are doing these steps N times, the algorithm is O(N*log(N)), much smaller than your O(2^N).
Of course, you may be able to find shortcuts to make the method even faster.
Interesting Notes
If you start with {1,1}, here are the pairs you can generate in k steps (we use k=0 for {1,1} itself), after removing duplicates:
k=0: {1,1}
k=1: {2, 1}, {1, 2}
k=2: {3, 1}, {2, 3}, {3, 2}, {1, 3}
k=3: {4, 1}, {3, 4}, {5, 3}, {2, 5}, {5, 2}, {3, 5}, {4, 3}, {1, 4}
k=4: {5, 1}, {4, 5}, {7, 4}, {3, 7}, {8, 3}, {5, 8}, {7, 5}, {2, 7}, {7, 2}, {5, 7}, {8, 5}, {3, 8}, {7, 3}, {4, 7}, {5, 4}, {1, 5}
k=5: {6, 1}, {5, 6}, {9, 5}, {4, 9}, {11, 4}, {7, 11}, {10, 7}, {3, 10}, {11, 3}, {8, 11}, {13, 8}, {5, 13}, {12, 5}, {7, 12}, {9, 7}, {2, 9}, {9, 2}, {7, 9}, {12, 7}, {5, 12}, {13, 5}, {8, 13}, {11, 8}, {3, 11}, {10, 3}, {7, 10}, {11, 7}, {4, 11}, {9, 4}, {5, 9}, {6, 5}, {1, 6}
Things to note:
You can generate N=7 and N=8 in 4 steps, but not N=6, which requires 5 steps.
The number of pairs generated is 2^k
The smallest number of steps (k) required to reach a given N is:
N=1: k=0
N=2: k=1
N=3: k=2
N=4: k=3
N=5: k=3
N=6: k=5
N=7: k=4
N=8: k=4
N=9: k=5
N=10: k=5
N=11: k=5
The resulting sequence, {0,1,2,3,3,5,4,4,5,5,5,...} is https://oeis.org/A178047
The highest number generated in k steps is the (k+2)nd Fibonacci number, http://oeis.org/A000045
The number of distinct integers you can reach in k steps is now the (k+1)st element of http://oeis.org/A293160
As an example for k=20:
There are 2^20 or 1048576 pairs when k=20
The highest number in any of the 1048576 pairs above is 17711, the 22nd (20+2) Fibonacci number
However, you can't reach all of the first 17711 integers with these pairs. You can only reach 11552 of them, the 21st (20+1) element of A293160
For details on how I worked this problem out, see https://github.com/barrycarter/bcapps/blob/master/STACK/bc-add-sets.m
I have this array ordered by hash[:points]
original = [{a: "Bill", points: 4}, {b: "Will", points: 3}, {c: "Gill", points: 2}, {d: "Pill", points: 1}]
I want to change the order of it's elements based on the order of a subset array, also ordered by hash[:points].
subset = [{c: "Gill", points: 2}, {b: "Will", points: 1}]
The subset's elements are always contained in the original. But the subset's length and order come at random. It may have two, three, or four elements, in any order.
I want to incorporate the order of the subset into the original array. This can be done by reordering the original, or recreating the original in the correct order. Either will do. But I don't want to merge them. The keys and values in the subset are not important, just the order of the elements.
For example, the above subset should produce this.
# Bill and Pill retain their original position
# Gill and Will swap places as per ordering of the subset
[{a: "Bill", points: 4}, {c: "Gill", points: 2}, {b: "Will", points: 3}, {d: "{Pill}", points: 1}]
Another example with this subset: [{c: "Gill", points: 3}, {b: "Will", points: 2}, {a: "Bill", points: 1}]
# Pill remains at index 3 since it was not in the subset
# Gill, Will, and Bill are reordered based on the order of the subset
[{c: "Gill", points: 3}, {b: "Will", points: 2}, {a: "Bill", points: 1}, {d: "Pill", points: 1}]
I've tried a bunch of stuff for the past couple of hours, but I'm finding this harder than it looks.
My solution has two steps:
Collect the relevant elements from the original array, and sort them according to the subset order.
Replace them in the original array with the new order.
Here is the code:
mapped_elements = subset.map { |i| original.find { |j| j.keys == i.keys } }
result = original.map do |i|
if subset.find { |j| j.keys == i.keys }
mapped_elements.shift
else
i
end
end
For subset = [{c: "Gill", points: 2}, {b: "Will", points: 1}] the result will be:
[{a: "Bill", points: 4}, {c: "Gill", points: 2}, {b: "Will", points: 3}, {d: "{Pill}", points: 1}]
For subset = [{c: "Gill", points: 3}, {b: "Will", points: 2}, {a: "Bill", points: 1}] the result will be:
[{c: "Gill", points: 3}, {b: "Will", points: 2}, {a: "Bill", points: 4}, {d: "Pill", points: 1}]
I know this sounds weird, but trust me, there is a very good reason for it.
Sorry, no can do. I think you chose the wrong structure for your hashes to begin with. I can't think of any reason why you would create hashes which have different keys for each person's name. When you are having trouble manipulating the data structure that you initially chose, then you should think about restructuring your data.
teams = {
Bill: {group: "a", points: 4},
Will: {group: "b", points: 3},
Gill: {group: "c", points: 2},
Pill: {group: "d", points: 1},
}
teams_subset = {
Gill: {group: "c", points: 3},
Will: {group: "b", points: 2},
Bill: {group: "a", points: 1},
}
subset_names = teams_subset.keys
new_teams = {}
teams.each do |team, stats|
if teams_subset.include? team
next_subset_name = subset_names.shift
new_teams[next_subset_name] = teams[next_subset_name]
else
new_teams[team] = stats
end
end
p new_teams
--output:--
{:Gill=>{:group=>"c", :points=>2}, :Will=>{:group=>"b", :points=>3}, :Bill=>{:group=>"a", :points=>4}, :Pill=>{:group=>"d", :points=>1}}
Or even:
teams = [
{name: 'Bill', stats: {group: "a", points: 4}},
{name: 'Will', stats: {group: "b", points: 3}},
{name: 'Gill', stats: {group: "c", points: 2}},
{name: 'Pill', stats: {group: "d", points: 1}},
]
Based on your new revelations, I would just use a quasi Schwarzian Transform to convert your data to this form:
"Bill"=>{:name=>"Bill", :points=>4}
...then apply code similar to what I posted above, like this:
require 'set'
teams = [
{name: 'Bill', points: 4},
{name: 'Will', points: 3},
{name: 'Gill', points: 2},
{name: 'Pill', points: 1},
]
remapped_teams = {}
teams.each do |hash|
name = hash[:name]
remapped_teams[name] = hash
end
p remapped_teams
#--output:--
#{"Bill"=>{:name=>"Bill", :points=>4}, "Will"=>{:name=>"Will", :points=>3}, "Gill"=>{:name=>"Gill", :points=>2}, "Pill"=>{:name=>"Pill", :points=>1}}
teams_subset = [
{name: 'Gill', points: 3},
{name: 'Will', points: 2},
{name: 'Bill', points: 1},
]
subset_names = teams_subset.map do |hash|
hash[:name]
end
subset_names_lookup = Set.new(subset_names)
new_team_order = remapped_teams.map do |(name, hash)|
if subset_names_lookup.include? name
remapped_teams[subset_names.shift]
else
hash
end
end
p new_team_order
--output:--
[{:name=>"Gill", :points=>2}, {:name=>"Will", :points=>3}, {:name=>"Bill", :points=>4}, {:name=>"Pill", :points=>1}]
Here's another way to do it. This is based on the understanding that each hash in original and in subset contain the same two keys: :name (say) and :points.
Code
def reorder(original, subset)
orig_hash = original.each_with_object({}) { |h,g| g[h[:name]] = h }
subset_names = subset.map { |h| h[:name] }
orig_hash.map { |k,v|
subset_names.include?(k) ? orig_hash[subset_names.rotate!.last] : v }
end
Examples
original = [{name: "Bill", points: 4}, {name: "Will", points: 3},
{name: "Gill", points: 2}, {name: "Pill", points: 1}]
#1
subset = [{name: "Gill", points: 2}, {name: "Will", points: 1}]
reorder(original, subset)
#=> [{:name=>"Bill", :points=>4}, {:name=>"Gill", :points=>2},
# {:name=>"Will", :points=>3}, {:name=>"Pill", :points=>
#2
subset = [{name: "Gill", points: 3}, {name: "Will", points: 2},
{name: "Bill", points: 1}]
reorder(original, subset)
#=> [{:name=>"Gill", :points=>2}, {:name=>"Will", :points=>3},
# {:name=>"Bill", :points=>4}, {:name=>"Pill", :points=>1}]
Explanation
The following calculations are performed for original above and
subset = [{c: "Gill", points: 2}, {b: "Will", points: 1}]
Construct this hash from original:
orig_hash = original.each_with_object({}) { |h,g| g[h[:name]] = h }
#=> {"Bill"=>{:name=>"Bill", :points=>4}, "Will"=>{:name=>"Will", :points=>3},
"Gill"=>{:name=>"Gill", :points=>2}, "Pill"=>{:name=>"Pill", :points=>1}}
and an array of the values of :name from subsets:
subset_names = subset.map { |h| h[:name] }
#=> ["Gill", "Will"]
All that remains is to map each k=>v element of orig_hash to either v (if subset_names does not have the key k) or to the first element of subset_names. As we cannot delete the latter from subset_names, I have chosen to rotate the array by +1 and then retrieve that value from the last position. That way, the next key in subset_names will then be positioned properly, at the beginning of the array.
orig_hash.map { |k,v| subset_names.include?(k) ? orig_hash[subset_names.rotate!.last] : v }
#=> [{:name=>"Bill", :points=>4}, {:name=>"Gill", :points=>2},
# {:name=>"Will", :points=>3}, {:name=>"Pill", :points=>1}]