Most common subpath in a collection of paths - algorithm

There is numerous literature on the Web for the longest common subsequence problem but I have a slightly different problem and was wondering if anyone knows of a fast algorithm.
Say, you have a collection of paths:
[1,2,3,4,5,6,7], [2,3,4,9,10], [3,4,6,7], ...
We see that subpath [3,4] is the most common.
Know of a neat algorithm to find this? For my case there are tens of thousands of paths!

Assuming that a "path" has to encompass at least two elements, then the most common path will obviously have two elements (although there could also be a path with more than two elements that's equally common -- more on this later). So you can just iterate all the lists and count how often each pair of consecutive numbers appears in the different lists and remember those pairs that appear most often. This requires iterating each list once, which is the minimum amount you'd have to do in any case.
If you are interested in the longest most common path, then you can start the same way, finding the most common 2-segment-paths, but additionally to the counts, also record the position of each of those segments (e.g. {(3,4): [2, 1, 0], ...} in your example, the numbers in the list indicating the position of the segment in the different paths). Now, you can take all the most-common length-2-paths and see if for any of those, the next element is also the same for all the occurrences of that path. In this case you have a most-common length-3-path that is equally common as the prior length-2 path (it can not be more common, obviously). You can repeat this for length-4, length-5, etc. until it can no longer be expanded without making the path "less common". This part requires extra work of n*k for each expansion, with n being the number of candidates left and k how often those appear.
(This assumes that frequency beats length, i.e. if there is a length-2 path appearing three times, you prefer this over a length-3 path appearing twice. The same apprach can also be used for a different starting length, e.g. requiring at least length-3 paths, without changing the basic algorithm or the complexity.)
Here's a simple example implementation in Python to demonstrate the algorithm. This only goes up to length-3, but could easily be extended to length-4 and beyond with a loop. Also, it does not check any edge-cases (array-out-of-bounds etc.)
# example data
data = [[1,2, 4,5,6,7, 9],
[1,2,3,4,5,6, 8,9],
[1,2, 4,5,6,7,8 ]]
# step one: count how often and where each pair appears
from collections import defaultdict
pairs = defaultdict(list)
for i, lst in enumerate(data):
for k, pair in enumerate(zip(lst, lst[1:])):
pairs[pair].append((i,k))
# step two: find most common pair and filter
most = max([len(lst) for lst in pairs.values()])
pairs = {k: v for k, v in pairs.items() if len(v) == most}
print(pairs)
# {(1, 2): [(0, 0), (1, 0), (2, 0)], (4, 5): [(0, 2), (1, 3), (2, 2)], (5, 6): [(0, 3), (1, 4), (2, 3)]}
# step three: expand pairs to triplets, triplets to quadruples, etc.
triples = [k + (data[v[0][0]][v[0][1]+2],)
for k, v in pairs.items()
if len(set(data[i][k+2] for (i,k) in v)) == 1]
print(triples)
# [(4, 5, 6)]

Related

Algorithm for finding max value of functions of the form f(x) = a*min(b, x)?

I have an array of tuples (a, b) with a > 0 and b > 0.
Each tuple represents a function f such that f(x, a, b) = a * min(b, x).
Is there a known algorithm for a given x to find which tuple returns the maximum value ?
I don't want to evaluate each function to check the maximum, because I will query this array arbitrary number of times for different x.
Example:
array = [ (1, 10), (2, 3) ]
x < 6 -> choose (2, 3)
x = 6 (intersection point) -> either (1, 10) or (2, 3) doesn't matter
x > 6 -> choose (1, 10)
So the problem is that these tuples can be either sorted by a or by b. But there can be a lot of intersection points between them (if we visualize them as graphs). So I want to avoid any O(n^2) sorting algorithm to check for certain ranges of x which is the best function. I mean I don't want to compare each function with all the others to find from which point x' (intersection point) and on I should choose one over the other.
Assuming a's, b's and queried x's are always nonnegative, each query can be done in O(log(n)) time after an O(n*log(n)) preprocessing step:
The preprocessing step eliminates such functions that are strictly dominated by others. For example, (5, 10) is larger than (1, 1) for every x. (So, if there is (5, 10) in the array, then we can remove (1, 1) because it will never be the maximum for any x.)
Here is the general condition: A function (a, b) is larger than (c, d) for every x if and only if c > a and (c*d > a*b). (This is easy to prove.)
Now, what we want to do is to remove such functions (a, b) for which there exists a (c, d) such that c > a and (c*d > a*b). This can be done in O(n*log(n)) time:
1 - Sort tuples lexicographically. What I mean by lexicographically is first compare their first coordinates, and if they are equal, then compare the second ones. For example, a sorted array might look like this:
(1, 5)
(1, 17)
(2, 9)
(4, 3)
(4, 4)
2 - Iterate over the sorted array in the reverse order and keep track of the largest value of a*b that you encountered so far. Let's call this value M. Now, assume the element that we are processing in the loop is (a, b). If a*b < M, we remove this element. Because for some (c, d) that we processed earlier, both c > a and c*d > a*b, and thus (a, b) is useless. After this step, the example array will become:
(2, 9)
(4, 4)
(4, 3) was deleted because it was dominated by (4, 4). (1, 17) and (1, 5) were deleted because they are dominated by (2, 9).
Once we get rid of all the functions that are never the maximum for any x, the graph of the remaining ones will look like this.
As seen in the graph, every function is the maximum from the point where it intersects with the one before to the point where it intersects with the one after. For the example above, (4, 4) and (2, 9) intersect at x = 8. So (4, 4) is the maximum until x = 8, and after that point, (2, 9) is the maximum.
We want to calculate the points where consecutive functions in the array intersect, so that for a given x, we can binary-search on these points to find which function returns the maximum value.
The key to efficiency is to avoid useless work. If you imagine a decision tree, pruning branches is a term often used for that.
For your case, the decision-making is based on choosing between two functions (or tuples of parameters). In order to select either of the two functions, you just determine the value x at which they give you the same value. One of them performs better for smaller values, one for larger values. Also, don't forget this part, it may be that one function always performs better than the other. In that case, the one performing worse can be removed completely (see also above, avoiding useless work!).
Using this approach, you can map from this switchover point to the function on the left. Finding the optimal function for an arbitrary value just requires finding the next higher switchover point.
BTW: Make sure you have unit tests in place. These things are fiddly, especially with floating point values and rounding errors, so you want to make sure that you can just run a growing suite of tests to make sure that one small bugfix didn't break things elsewhere.
I think you should sort array based on 'b' first and then 'a'. Now for every x just use binary search and find the position from which min(b,x) will give either only b or x depending on value. So from that point if x is small then all the upcoming value of b then take tuple as t1 and and you can count value using that function and for the value of b which will be less than x you compulsorily need traverse. I'm not sure but that's what I can think.
After pre-processing the data, it's possible to calculate this maximum value in time O(log(n)), where n is the number of tuples (a, b).
First, let's look at a slightly simpler question: You have a list of pairs (c, b), and you want to find the one with the largest value of c, subject to the condition that b<=x, and you want to do this many times for different values of x. For example, the following list:
c b
------
11 16
8 12
2 6
7 9
6 13
4 5
With this list, if you ask with x=10, the available values of c are 2, 7 and 4, and the maximum is 7.
Let's sort the list by b:
c b
------
4 5
2 6
7 9
8 12
6 13
11 16
Of course, some values in this list can never give an answer. For example, we can never use the b=2, c=6 row in an answer, because if 6<=x then 5<=x, so we can use the c=4 row to get a better answer. So we might as well get rid of pairs like that in the list, i.e. all pairs for which the value of c is not the highest so far. So we whittle the list down to this:
c b
------
4 5
7 9
8 12
11 16
Given this list, with an index on b, it's easy to find the highest value of c. All you have to do is find the highest value of b in the list which is <=x, then return the corresponding value of c.
Obviously, if you change the question so that you only want the values with b>=x (instead of b<=x), you can do exactly the same thing.
Right. So how does this help with the question you asked?
For a given value of x, you can split the question into 2 questions. If you can answer both of these questions then you can answer the overall question:
Of the pairs (a, b) with b<=x, which one gives the highest value of f(x,a,b) = a*b?
Of the pairs (a, b) with b>=x, which one gives the highest value of f(x,a,b) = a*x?
For (1), simply let c=a*b for each pair and then go through the whole indexing rigmarole outlined above.
For (2), let c=a and do the indexing thing above, but flipped round to do b>=x instead of b<=x; when you get your answer for a, don't forget to multiply it by x.

Sorting a list that have n fixed segments already sorted in ascending order

The question goes as follows:
Lists which consist of a small fixed number, n, of segments connected
end-to-end, each segment is already in ascending order.
I thought about using mergesort with base case being if equal to n then go back and merge them since we already know that they are sorted, but if I have 3 segments it won't work since I'm dividing by two and you can't divide 3 segments equally into two parts.
The other approach which is similar to merge sort. so I use n stacks for each segment, which we can identify if L[i] > L[i+1] since segments are in ascending order. But I need n comparisons to figure out which element comes first, and I don't know an efficient way of comparing n elements dynamically without using another data structure to compare the elements at the top of the stack.
Also, you are supposed to use the problem feature, segments already ordered, to get better results than conventional algorithms. i.e. complexity less than O(nlogn).
A pseudocode would be nice if you have an idea.
Edit:
An example would be [(14,20,22),(7,8,9),(1,2,3)] here we have 3 segments of 3 elements, even though the segments are sorted, the whole list isn't.
p.s. () is there to point out the segments only
I think maybe you've misunderstood mergesort. While usually you would split in half and sort each half before merging, it's really the merging part which makes the algorithm. You just need to merge on runs instead.
With your example of [(14,20,22),(7,8,9),(1,2,3)]
After first merge you have [(7, 8, 9, 14, 20, 22),(1, 2, 3)]
After second merge you have [(1, 2, 3, 7, 8, 9, 14, 20, 22)]
l = [14, 20, 22, 7, 8, 9, 1, 2, 3]
rl = [] # run list
sl = [l[0]] # temporary sublist
#split list into list of sorted sublists
for item in l[1:]:
if item > sl[-1]:
sl.append(item)
else:
rl.append(sl)
sl = [item]
rl.append(sl)
print(rl)
#function for merging two sorted lists
def merge(l1, l2):
l = [] #list we add into
while True:
if not l1:
# first list is empty, add second list onto new list
return l + l2
if not l2:
# second list is empty, add first list onto new list
return l + l1
if l1[0] < l2[0]:
# rather than deleting, you could increment an index
# which is likely to be faster, or reverse the list
# and pop off the end, or use a data structure which
# allows you to pop off the front
l.append(l1[0])
del l1[0]
else:
l.append(l2[0])
del l2[0]
# keep mergins sublists until only one remains
while len(rl) > 1:
rl.append(merge(rl.pop(), rl.pop()))
print(rl)
It's worth noting that unless this is simply an excercise, you are probably better off using whatever inbuilt sorting function your language of choice uses.

Python: Combinations

I am looking to solve a problem that involves different permutations of an array. I would like a function that checks if the array under scrutiny matches a condition, but if it does not, generates a new permutation to check, and so on, so forth. I believe this involves a while statement, so my question lies more in how to create such an algorithm to generate a unique (but not random so as to avoid duplicates) permutation upon each iteration. There exists a restriction: the array will contain at least 2 but no more than 10 elements. Additionally, if the condition matches no permutations, the return should be False I have no code thus far, as I cannot come up with the algorithm I would like to persue yet. Any thoughts would be helpful.
Why do you need to reinvent the wheel? Since you've tagged python, you should know there are a ton of libraries that help you do useful things like this. One such library is itertools, more specifically the itertools.permutations function:
>>> from itertools import permutations
>>> x = [1, 2, 3, 4, 5, 6]
>>> for p in permutations(x):
... print(p)
...
(1, 2, 3)
(1, 3, 2)
(2, 1, 3)
(2, 3, 1)
(3, 1, 2)
(3, 2, 1)
If you must write an algorithm yourself, then you should learn about the Johnson-Trotter Algorithm for generating permutations. It is quite intuitive, and generates permutations in O(n!) time.

Compare rotated lists, containing duplicates [duplicate]

This question already has answers here:
How to check whether two lists are circularly identical in Python
(18 answers)
Closed 7 years ago.
I'm looking for an efficient way to compare lists of numbers to see if they match at any rotation (comparing 2 circular lists).
When the lists don't have duplicates, picking smallest/largest value and rotating both lists before comparisons works.
But when there may be many duplicate large values, this isn't so simple.
For example, lists [9, 2, 0, 0, 9] and [0, 0, 9, 9, 2] are matches,where [9, 0, 2, 0, 9] won't (since the order is different).
Heres an example of an in-efficient function which works.
def min_list_rotation(ls):
return min((ls[i:] + ls[:i] for i in range(len(ls))))
# example use
ls_a = [9, 2, 0, 0, 9]
ls_b = [0, 0, 9, 9, 2]
print(min_list_rotation(ls_a) == min_list_rotation(ls_b))
This can be improved on for efficiency...
check sorted lists match before running exhaustive tests.
only test rotations that start with the minimum value(skipping matching values after that)effectively finding the minimum value with the furthest & smallest number after it (continually - in the case there are multiple matching next-biggest values).
compare rotations without creating the new lists each time..
However its still not a very efficient method since it relies on checking many possibilities.
Is there a more efficient way to perform this comparison?
Related question:
Compare rotated lists in python
If you are looking for duplicates in a large number of lists, you could rotate each list to its lexicographically minimal string representation, then sort the list of lists or use a hash table to find duplicates. This canonicalisation step means that you don't need to compare every list with every other list. There are clever O(n) algorithms for finding the minimal rotation described at https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation.
You almost have it.
You can do some kind of "normalization" or "canonicalisation" of a list independently of the others, then you only need to compare item by item (or if you want, put them in a map, in a set to eliminate duplicates, ..."
1 take the minimum item, which is not preceded by itself (in a circular way)
In you example 92009, you should take the first 0 (not the second one)
2 If you have always the same item (say 00000), you just keep that: 00000
3 If you have the same item several times, take the next item, which is minimal, and keep going until you find one unique path with minimums.
Example: 90148301562 => you have 0148.. and 0156.. => you take 0148
4 If you can not separate the different paths (= if you have equality at infinite), you have a repeating pattern: then, no matters: you take any of them.
Example: 014376501437650143765 : you have the same pattern 0143765...
It is like AAA, where A = 0143765
5 When you have your list in this form, it is easy to compare two of them.
How to do that efficiently:
Iterate on your list to get the minimums Mx (not preceded by itself). If you find several, keep all of them.
Then, iterate from each minimum Mx, take the next item, and keep the minimums. If you do an entire cycle, you have a repeating pattern.
Except the case of repeating pattern, this must be the minimal way.
Hope it helps.
I would do this in expected O(N) time using a polynomial hash function to compute the hash of list A, and every cyclic shift of list B. Where a shift of list B has the same hash as list A, I'd compare the actual elements to see if they are equal.
The reason this is fast is that with polynomial hash functions (which are extremely common!), you can calculate the hash of each cyclic shift from the previous one in constant time, so you can calculate hashes for all of the cyclic shifts in O(N) time.
It works like this:
Let's say B has N elements, then the the hash of B using prime P is:
Hb=0;
for (i=0; i<N ; i++)
{
Hb = Hb*P + B[i];
}
This is an optimized way to evaluate a polynomial in P, and is equivalent to:
Hb=0;
for (i=0; i<N ; i++)
{
Hb += B[i] * P^(N-1-i); //^ is exponentiation, not XOR
}
Notice how every B[i] is multiplied by P^(N-1-i). If we shift B to the left by 1, then every every B[i] will be multiplied by an extra P, except the first one. Since multiplication distributes over addition, we can multiply all the components at once just by multiplying the whole hash, and then fix up the factor for the first element.
The hash of the left shift of B is just
Hb1 = Hb*P + B[0]*(1-(P^N))
The second left shift:
Hb2 = Hb1*P + B[1]*(1-(P^N))
and so on...

Find all sets doesn't contain any sub sets

I have a problem and I'm researching the fastest algorithm to find a set that is subset of original set (S) and doesn't contain any subsets (S1, ... Sn) of S. The set I want to find can contain some elements of Si but doesn't contain the whole.
For example, original set: S = (1, 2, 3, 4, 5), S1 = (1, 2), S2 = (1, 3)
=> longest set: (2, 3, 4, 5); other sets: (1, 4, 5), (2, 4, 5), (3, 4, 5), (1, 4),...
Anybody can give me a suggestion? Thanks!
Bad news
Consider the problem of choosing which elements to NOT include.
If we choose to NOT include element 1, we satisfy the constraints for S1 and S2.
If we choose to NOT include element 2, we satisfy the constraints for S1.
If we choose to NOT include element 3, we satisfy the constraints for S1 and S3.
So 1 gives {S1,S2}, 2 gives {S1}, 3 gives {S3}.
Your problem can be expressed as finding the minimum number of elements to NOT include such that the union of the satisfied sets (e.g. {S1,S2}) covers all of the given sets.
This is exactly the set cover problem which is NP complete.
Good news
In practice, you will probably do quite well by simply choosing the elements to NOT include based on whichever ends up covering the most sets.
This is an easy to implement greedy algorithm (although it will not always give the optimal answer).

Resources