Fast set overlap matching algorithm - algorithm

Let's say I have two sets:
A = [1, 3, 5, 7, 9, 11]
and
B = [1, 3, 9, 11, 12, 13, 14]
Both sets can be of arbitrary (and differing numbers of elements).
I am writing a performance critical application that requires me to perform a search to determine the number of elements which both sets have in common. I don't actually need to return the matches, only the number of matches.
Obviously, a naive method would be a brute force, but I suspect that is nowhere near optimal. Is there an algorithm for performing this type of operation?
If it helps, in all cases the sets will consists of integers.

If both sets are roughly the same size, walking over them in sync, similar to a merge sort merge operation, is about as fast as it gets.
Look at the first elements.
If they match, you add that element to your result, and move both pointers forward.
Otherwise, you move the pointer that points to the smallest value forward.
Some pseudo-Python:
a = []
b = []
res = []
ai = 0
bi = 0
while ai < len(a) and bi < len(b):
if a[ai] == b[bi]:
res += a[ai]
ai+=1
bi+=1
elif a[ai] < b[bi]:
ai+=1
else:
bi+=1
return res
If one set is significantly larger than the other, you can use binary search to look for each item from the smaller in the larger.

Here is the idea (very high level description though).
By the way, I'll take the liberty to assume that the numbers in each set are not appearing more than once, for instance [1,3,5,5,7,7,9,11] will not take place.
You define two variables that will hold the indices you are examining in each array.
You start with the first number of each set and compare them. Two possible conditions: they are equal or one is bigger than the other.
If they are equal, you count the event and move the pointers in both arrays to the next element.
If they differ, you move the pointer of the lower value to the next element in the array and repeat the process (compare both values).
The loop ends when you reach the last element of either array.
Hope I was able to explain it in a clear way.

If both set are sorted, the smallest element of both sets is either the minimum of the first set, or the minimum of second set. If it's the min of the first set, then the next smallest element is either the minimum of the second set or the 2nd minimum of first set. If you repeat this till the end of both sets you have ordered both set. For your specific problem you just need to compare if elements are also equals.
You can iterate over the union of both sets with the following algorithm:
intersection_set_cardinality(s1, s2)
{
iterator i = begin(s1);
iterator j = begin(s2);
count = 0;
while(i != end(s1) && j != end(s2))
{
if(elt(i) == elt(j))
{
count = count + 1;
i = i + 1;
j = j + 1;
}
else if(elt(i) < elt(j))
{
i = i + 1;
}
else
{
j = j + 1;
}
}
return count
}

Related

Finding the binary sub-tree that puts each element of a list into its own lowest order bucket

First, I have a list of numbers 'L', containing elements 'x' such that 0 < 'x' <= 'M' for all elements 'x'.
Second, I have a binary tree constructed in the following manner:
1) Each node has three properties: 'min', 'max', and 'vals' (vals is a list of numbers).
2) The root node has 'max'='M', 'min'=0, and 'vals'='L' (it contains all the numbers)
3) Each left child node has:
max=(parent(max) + parent(min))/2
min=parent(min)
4) Each right child node has:
max=parent(max)
min=(parent(max) + parent(min))/2
5) For each node, 'vals' is a list of numbers such that each element 'x' of
'vals' is also an element of 'L' and satisfies
min < x <= max
6) If a node has only one element in 'vals', then it has no children. I.e., we are
only looking for nodes for which 'vals' is non-empty.
I'm looking for an algorithm to find the smallest sub-tree that satisfies the above properties. In other words, I'm trying to get a list of nodes such that each child-less node contains one - and only one - element in 'vals'.
I'm almost able to brute-force it with perl using insanely baroque data structures, but I keep bumping up against the limits of my mental capacity to keep track of all the temporary variables I've used, so I'm asking for help.
It's even cooler if you know an efficient algorithm to do the above.
If you'd like to know what I'm trying to do, it's this: find the smallest covering for a discrete wavelet packet transform to uniquely separate each frequency of the standard even-tempered musical notes. The trouble is that each iteration of the wavelet transform divides the frequency range it handles in half (hence the .../2 above defining the max and min), and the musical notes have frequencies which go up exponentially, so there's no obvious relationship between the two - not one I'm able to derive analytically (or experimentally, obviously, for that matter), anyway.
Since I'm really trying to find an algorithm so I can write a program, and since the problem is put in general terms, I didn't think it appropriate to put it in DSP. If there were a general "algorithms" group, then I think it would be better there, but this seems to be the right group for algorithms in the absence of such.
Please let me know if I can clarify anything, or if you have any suggestions - even in the absence of a complete answer - any help is appreciated!
After taking a break and two cups of coffee, I answered my own question. Indexing below is done starting at 1, MATLAB-style...
L=[] // list of numbers to put in bins
sorted=[] // list of "good" nodes
nodes=[] // array of nodes to construct
sortedidx=1
nodes[1]={ min = 0, max = 22100, val = -1, lvl = 1, row = 1 }
for(j=1;j<=12;j++) { // 12 is a reasonable guess
level=j+1
row=1
for(i=2^j;i<2^(j+1);i++) {
if(i/2 == int(i/2)) { // even nodes are high-pass filters
nodes[i]={ min = (nodes[i/2].min + nodes[i/2].max)/2, // nodes[i/2] is parent
max = nodes[i/2].max,
val = -1,
lvl = level,
row = -1
}
} else { // odd nodes are lo-pass
nodes[i]={ min = nodes[(i-1)/2].min,
max = (nodes[(i-1)/2].min+nodes[(i-1)/2].max)/2,
val = -1,
lvl = level,
row = -1
}
}
temp=[] // array to count matching numbers
tempidx=1
Lidx=0
for (k=1;k<=size(L);k++) {
if (nodes[i].min < L[k] && L[k] <= nodes[i].max) {
temp[tempidx++]=nodes[i]
Lidx=k
}
}
if (size(temp) == 1) {
nodes[i].row = row++
nodes[i].val = temp[1]
delete L[Lidx]
sorted[sortedidx++]=nodes[i]
}
}
}
Now array sorted[] contains exactly what I was looking for!
Hopefully this helps somebody else someday...

Number of distinct sequences of fixed length which can be generated using a given set of numbers

I am trying to find different sequences of fixed length which can be generated using the numbers from a given set (distinct elements) such that each element from set should appear in the sequence. Below is my logic:
eg. Let the set consists of S elements, and we have to generate sequences of length K (K >= S)
1) First we have to choose S places out of K and place each element from the set in random order. So, C(K,S)*S!
2) After that, remaining places can be filled from any values from the set. So, the factor
(K-S)^S should be multiplied.
So, overall result is
C(K,S)S!((K-S)^S)
But, I am getting wrong answer. Please help.
PS: C(K,S) : No. of ways selecting S elements out of K elements (K>=S) irrespective of order. Also, ^ : power symbol i.e 2^3 = 8.
Here is my code in python:
# m is the no. of element to select from a set of n elements
# fact is a list containing factorial values i.e. fact[0] = 1, fact[3] = 6& so on.
def ways(m,n):
res = fact[n]/fact[n-m+1]*((n-m)**m)
return res
What you are looking for is the number of surjective functions whose domain is a set of K elements (the K positions that we are filling out in the output sequence) and the image is a set of S elements (your input set). I think this should work:
static int Count(int K, int S)
{
int sum = 0;
for (int i = 1; i <= S; i++)
{
sum += Pow(-1, (S-i)) * Fact(S) / (Fact(i) * Fact(S - i)) * Pow(i, K);
}
return sum;
}
...where Pow and Fact are what you would expect.
Check out this this math.se question.
Here's why your approach won't work. I didn't check the code, just your explanation of the logic behind it, but I'm pretty sure I understand what you're trying to do. Let's take for example K = 4, S = {7,8,9}. Let's examine the sequence 7,8,9,7. It is a unique sequence, but you can get to it by:
Randomly choosing positions 1,2,3, filling them randomly with 7,8,9 (your step 1), then randomly choosing 7 for the remaining position 4 (your step 2).
Randomly choosing positions 2,3,4, filling them randomly with 8,9,7 (your step 1), then randomly choosing 7 for the remaining position 1 (your step 2).
By your logic, you will count it both ways, even though it should be counted only once as the end result is the same. And so on...

Code Complexity in 3 array case

You are given with three sorted arrays ( in ascending order), you are required to find a triplet ( one element from each array) such that distance is minimum.
Distance is defined like this :
If a[i], b[j] and c[k] are three elements then
distance = max{abs(a[i]-b[j]),abs(a[i]-c[k]),abs(b[j]-c[k])}
Please give a solution in O(n) time complexity
Linear time algorithm:
double MinimalDistance(double[] A, double[] B, double[] C)
{
int i,j,k = 0;
double min_value = infinity;
double current_val;
int opt_indexes[3] = {0, 0, 0);
while(i < A.size || j < B.size || k < C.size)
{
current_val = calculate_distance(A[i],B[j],C[k]);
if(current_val < min_value)
{
min_value = current_val;
opt_indexes[1] = i;
opt_indexes[2] = j;
opt_indexes[3] = k;
}
if(A[i] < B[j] && A[i] < C[k] && i < A.size)
i++;
else if (B[j] < C[k] && j < B.size)
j++;
else
k++;
}
return min_value;
}
In each step you check the current distance, then increment the index of the array currently pointing to the minimal value. each array is iterated through exactly once, which mean the running time is O(A.size + B.size + C.size).
if you want the optimal indexes instead of the minimal values, you can return opt_indexes instead of min_value.
Suppose we have just one sorted array, then 3 consecutive elements which have less possible distances are the desired solution. Now when we have three arrays, just merge them all and make a big sorted array ABC (this can be done in O(n) by merge operation in merge-sort), just keep a flag to determine which element belongs in which original array. Now you have to find three consecutive elements in array like this:
a1,a2,b1,b2,b3,c1,b4,c2,c3,c4,b5,b6,a3,a4,a5,....
and here consecutive means they belong to the 3 different group in consecutive order, e.g: a2,b3,c1 or c4,b6,a3.
Now finding this tree elements is not hard, sure smallest and greatest one should be last and first of a elements of first and last group in some triple, e.g in the group: [c2,c3,c4],[b5,b6],[a3,a4,a5], we don't need to check a4,a5,c2,c3 is clear that possible solution in this case is among c4,[b5,b6],a5, also we don't need to compare c4 with b5,b6, or a5 with b5,b6, sure distance is made by a5-c4 (in this group). So we can start from left and keep track of last element and update best possible solution in each iteration by just keeping the last visited value of each group.
Example (first I should say that I didn't wrote the code because I think this is OP's task not me):
Suppose we have this sequences after sorted array:
a1,a2,b1,b2,b3,c1,b4,c2,c3,c4,b5,b6,a3,a4,a5,....
let iterate step by step:
We need to just keep track of last item for each item from our arrays, a is for keeping track of current best a_i, b for b_i, and c for c_i. suppose at first a_i=b_i=c_i=-1,
in the first step a will be a1, in the next step
a=a2,b=-1,c=-1
a=a2,b=b1,c=-1
a=a2,b=b2,c=-1
a=a2,b=b3,c=-1,
a=a2,b=b3,c=c1,
At this point we save current pointers (a2,b3,c1) as a best value for difference,
In the next step:
a=a2,c=c1,b=b4
Now we compare the difference of b4-a2 with previously best option, if is better, we save this pointers as a solution upto now and we proceed:
a=a2,b=b4,c=c2 (again compare and if needed update the best solution),
a=a2,b=b4,c=c3 (again ....)
a=a2,b=b4,c=c4 (again ....)
a=a2, b=b5,c=c4, ....
Ok if is not clear from the text, after merge we have (I'll suppose all of array have at least one element):
solution = infinite;
a=b=c=-1,
bestA=bestB=bestC=1;
for (int i=0;i<ABC.Length;i++)
{
if(ABC[i].type == "a") // type is a flag determines
// who is the owner of this element
{
a=ABC[i].Value;
if (b!=-1&&c!=-1)
{
if (max(|a-b|,|b-c|,|a-c|) < solution)
{
solution = max(|a-b|,|b-c|,|a-c|);
bestA= a,bestB = b,bestC = c;
}
}
}
// and two more if for type "b" and "c"
}
Sure there is more elegant algorithm than this, but I see you had problem with your link, so I guess this trivial way of looking at problem makes it easier, afterward you can understand your own link.

algorithm to find longest non-overlapping sequences

I am trying to find the best way to solve the following problem. By best way I mean less complex.
As an input a list of tuples (start,length) such:
[(0,5),(0,1),(1,9),(5,5),(5,7),(10,1)]
Each element represets a sequence by its start and length, for example (5,7) is equivalent to the sequence (5,6,7,8,9,10,11) - a list of 7 elements starting with 5. One can assume that the tuples are sorted by the start element.
The output should return a non-overlapping combination of tuples that represent the longest continuous sequences(s). This means that, a solution is a subset of ranges with no overlaps and no gaps and is the longest possible - there could be more than one though.
For example for the given input the solution is:
[(0,5),(5,7)] equivalent to (0,1,2,3,4,5,6,7,8,9,10,11)
is it backtracking the best approach to solve this problem ?
I'm interested in any different approaches that people could suggest.
Also if anyone knows a formal reference of this problem or another one that is similar I'd like to get references.
BTW - this is not homework.
Edit
Just to avoid some mistakes this is another example of expected behaviour
for an input like [(0,1),(1,7),(3,20),(8,5)] the right answer is [(3,20)] equivalent to (3,4,5,..,22) with length 20. Some of the answers received would give [(0,1),(1,7),(8,5)] equivalent to (0,1,2,...,11,12) as right answer. But this last answer is not correct because is shorter than [(3,20)].
Iterate over the list of tuples using the given ordering (by start element), while using a hashmap to keep track of the length of the longest continuous sequence ending on a certain index.
pseudo-code, skipping details like items not found in a hashmap (assume 0 returned if not found):
int bestEnd = 0;
hashmap<int,int> seq // seq[key] = length of the longest sequence ending on key-1, or 0 if not found
foreach (tuple in orderedTuples) {
int seqLength = seq[tuple.start] + tuple.length
int tupleEnd = tuple.start+tuple.length;
seq[tupleEnd] = max(seq[tupleEnd], seqLength)
if (seqLength > seq[bestEnd]) bestEnd = tupleEnd
}
return new tuple(bestEnd-seq[bestEnd], seq[bestEnd])
This is an O(N) algorithm.
If you need the actual tuples making up this sequence, you'd need to keep a linked list of tuples hashed by end index as well, updating this whenever the max length is updated for this end-point.
UPDATE: My knowledge of python is rather limited, but based on the python code you pasted, I created this code that returns the actual sequence instead of just the length:
def get_longest(arr):
bestEnd = 0;
seqLengths = dict() #seqLengths[key] = length of the longest sequence ending on key-1, or 0 if not found
seqTuples = dict() #seqTuples[key] = the last tuple used in this longest sequence
for t in arr:
seqLength = seqLengths.get(t[0],0) + t[1]
tupleEnd = t[0] + t[1]
if (seqLength > seqLengths.get(tupleEnd,0)):
seqLengths[tupleEnd] = seqLength
seqTuples[tupleEnd] = t
if seqLength > seqLengths.get(bestEnd,0):
bestEnd = tupleEnd
longestSeq = []
while (bestEnd in seqTuples):
longestSeq.append(seqTuples[bestEnd])
bestEnd -= seqTuples[bestEnd][1]
longestSeq.reverse()
return longestSeq
if __name__ == "__main__":
a = [(0,3),(1,4),(1,1),(1,8),(5,2),(5,5),(5,6),(10,2)]
print(get_longest(a))
Revised algorithm:
create a hashtable of start->list of tuples that start there
put all tuples in a queue of tupleSets
set the longestTupleSet to the first tuple
while the queue is not empty
take tupleSet from the queue
if any tuples start where the tupleSet ends
foreach tuple that starts where the tupleSet ends
enqueue new tupleSet of tupleSet + tuple
continue
if tupleSet is longer than longestTupleSet
replace longestTupleSet with tupleSet
return longestTupleSet
c# implementation
public static IList<Pair<int, int>> FindLongestNonOverlappingRangeSet(IList<Pair<int, int>> input)
{
var rangeStarts = input.ToLookup(x => x.First, x => x);
var adjacentTuples = new Queue<List<Pair<int, int>>>(
input.Select(x => new List<Pair<int, int>>
{
x
}));
var longest = new List<Pair<int, int>>
{
input[0]
};
int longestLength = input[0].Second - input[0].First;
while (adjacentTuples.Count > 0)
{
var tupleSet = adjacentTuples.Dequeue();
var last = tupleSet.Last();
int end = last.First + last.Second;
var sameStart = rangeStarts[end];
if (sameStart.Any())
{
foreach (var nextTuple in sameStart)
{
adjacentTuples.Enqueue(tupleSet.Concat(new[] { nextTuple }).ToList());
}
continue;
}
int length = end - tupleSet.First().First;
if (length > longestLength)
{
longestLength = length;
longest = tupleSet;
}
}
return longest;
}
tests:
[Test]
public void Given_the_first_problem_sample()
{
var input = new[]
{
new Pair<int, int>(0, 5),
new Pair<int, int>(0, 1),
new Pair<int, int>(1, 9),
new Pair<int, int>(5, 5),
new Pair<int, int>(5, 7),
new Pair<int, int>(10, 1)
};
var result = FindLongestNonOverlappingRangeSet(input);
result.Count.ShouldBeEqualTo(2);
result.First().ShouldBeSameInstanceAs(input[0]);
result.Last().ShouldBeSameInstanceAs(input[4]);
}
[Test]
public void Given_the_second_problem_sample()
{
var input = new[]
{
new Pair<int, int>(0, 1),
new Pair<int, int>(1, 7),
new Pair<int, int>(3, 20),
new Pair<int, int>(8, 5)
};
var result = FindLongestNonOverlappingRangeSet(input);
result.Count.ShouldBeEqualTo(1);
result.First().ShouldBeSameInstanceAs(input[2]);
}
This is a special case of the longest path problem for weighted directed acyclic graphs.
The nodes in the graph are the start points and the points after the last element in a sequence, where the next sequence could start.
The problem is special because the distance between two nodes must be the same independently of the path.
Just thinking about the algorithm in basic terms, would this work?
(apologies for horrible syntax but I'm trying to stay language-independent here)
First the simplest form: Find the longest contiguous pair.
Cycle through every member and compare it to every other member with a higher startpos. If the startpos of the second member is equal to the sum of the startpos and length of the first member, they are contiguous. If so, form a new member in a new set with the lower startpos and combined length to represent this.
Then, take each of these pairs and compare them to all of the single members with a higher startpos and repeat, forming a new set of contiguous triples (if any exist).
Continue this pattern until you have no new sets.
The tricky part then is you have to compare the length of every member of each of your sets to find the real longest chain.
I'm pretty sure this is not as efficient as other methods, but I believe this is a viable approach to brute forcing this solution.
I'd appreciate feedback on this and any errors I may have overlooked.
Edited to replace pseudocode with actual Python code
Edited AGAIN to change the code; The original algorithm was on the solution, but I missunderstood what the second value in the pairs was! Fortunatelly the basic algorithm is the same, and I was able to change it.
Here's an idea that solves the problem in O(N log N) and doesn't use a hash map (so no hidden times). For memory we're going to use N * 2 "things".
We're going to add two more values to each tuple: (BackCount, BackLink). In the successful combination BackLink will link from right to left from the right-most tuple to the left-most tuple. BackCount will be the value accumulated count for the given BackLink.
Here's some python code:
def FindTuplesStartingWith(tuples, frm):
# The Log(N) algorithm is left as an excersise for the user
ret=[]
for i in range(len(tuples)):
if (tuples[i][0]==frm): ret.append(i)
return ret
def FindLongestSequence(tuples):
# Prepare (BackCount, BackLink) array
bb=[] # (BackCount, BackLink)
for OneTuple in tuples: bb.append((-1,-1))
# Prepare
LongestSequenceLen=-1
LongestSequenceTail=-1
# Algorithm
for i in range(len(tuples)):
if (bb[i][0] == -1): bb[i] = (0, bb[i][1])
# Is this single pair the longest possible pair all by itself?
if (tuples[i][1] + bb[i][0]) > LongestSequenceLen:
LongestSequenceLen = tuples[i][1] + bb[i][0]
LongestSequenceTail = i
# Find next segment
for j in FindTuplesStartingWith(tuples, tuples[i][0] + tuples[i][1]):
if ((bb[j][0] == -1) or (bb[j][0] < (bb[i][0] + tuples[i][1]))):
# can be linked
bb[j] = (bb[i][0] + tuples[i][1], i)
if ((bb[j][0] + tuples[j][1]) > LongestSequenceLen):
LongestSequenceLen = bb[j][0] + tuples[j][1]
LongestSequenceTail=j
# Done! I'll now build up the solution
ret=[]
while (LongestSequenceTail > -1):
ret.insert(0, tuples[LongestSequenceTail])
LongestSequenceTail = bb[LongestSequenceTail][1]
return ret
# Call the algoritm
print FindLongestSequence([(0,5), (0,1), (1,9), (5,5), (5,7), (10,1)])
>>>>>> [(0, 5), (5, 7)]
print FindLongestSequence([(0,1), (1,7), (3,20), (8,5)])
>>>>>> [(3, 20)]
The key for the whole algorithm is where the "THIS IS THE KEY" comment is in the code. We know our current StartTuple can be linked to EndTuple. If a longer sequence that ends at EndTuple.To exists, it was found by the time we got to this point, because it had to start at an smaller StartTuple.From, and the array is sorted on "From"!
I removed the previous solution because it was not tested.
The problem is finding the longest path in a "weighted directed acyclic graph", it can be solved in linear time:
http://en.wikipedia.org/wiki/Longest_path_problem#Weighted_directed_acyclic_graphs
Put a set of {start positions} union {(start position + end position)} as vertices. For your example it would be {0, 1, 5, 10, 11, 12}
for vertices v0, v1 if there is an end value w that makes v0 + w = v1, then add a directed edge connecting v0 to v1 and put w as its weight.
Now follow the pseudocode in the wikipedia page. since the number of vertices is the maximum value of 2xn (n is number of tuples), the problem can still be solved in linear time.
This is a simple reduce operation. Given a pair of consecutive tuples, they either can or can't be combined. So define the pairwise combination function:
def combo(first,second):
if first[0]+first[1] == second[0]:
return [(first[0],first[1]+second[1])]
else:
return [first,second]
This just returns a list of either one element combining the two arguments, or the original two elements.
Then define a function to iterate over the first list and combine pairs:
def collapse(tupleList):
first = tupleList.pop(0)
newList = []
for item in tupleList:
collapsed = combo(first,item)
if len(collapsed)==2:
newList.append(collapsed[0])
first = collapsed.pop()
newList.append(first)
return newList
This keeps a first element to compare with the current item in the list (starting at the second item), and when it can't combine them it drops the first into a new list and replaces first with the second of the two.
Then just call collapse with the list of tuples:
>>> collapse( [(5, 7), (12, 3), (0, 5), (0, 7), (7, 2), (9, 3)] )
[(5, 10), (0, 5), (0, 12)]
[Edit] Finally, iterate over the result to get the longest sequence.
def longest(seqs):
collapsed = collapse(seqs)
return max(collapsed, key=lambda x: x[1])
[/Edit]
Complexity O(N). For bonus marks, do it in reverse so that the initial pop(0) becomes a pop() and you don't have to reindex the array, or move the iterator instead. For top marks make it run as a pairwise reduce operation for multithreaded goodness.
This sounds like a perfect "dynamic programming" problem...
The simplest program would be to do it brute force (e.g. recursive), but this has exponential complexity.
With dynamic programming you can set up an array a of length n, where n is the maximum of all (start+length) values of your problem, where a[i] denotes the longest non-overlapping sequence up to a[i]. You can then step trought all tuples, updating a. The complexity of this algorithm would be O(n*k), where k is the number of input values.
Create an ordered array of all start and end points and initialise all of them to one
For each item in your tuple, compare the end point (start and end) to the ordered items in your array, if any point is between them (e.g. point in the array is 5 and you have start 2 with length 4) change value to zero.
After finishing the loop, start moving across the ordered array and create a strip when you see 1 and while you see 1, add to the existing strip, with any zero, close the strip and etc.
At the end check the length of strips
I think complexity is around O(4-5*N)
(SEE UPDATE)
with N being number of items in the tuple.
UPDATE
As you figured out, the complexity is not accurate but definitely very small since it is a function of number of line stretches (tuple items).
So if N is number of line stretches, sorting is O(2N * log2N). Comparison is O(2N). Finding line stretches is also O(2N). So all in all O(2N(log2N + 2)).

Is it possible to rearrange an array in place in O(N)?

If I have a size N array of objects, and I have an array of unique numbers in the range 1...N, is there any algorithm to rearrange the object array in-place in the order specified by the list of numbers, and yet do this in O(N) time?
Context: I am doing a quick-sort-ish algorithm on objects that are fairly large in size, so it would be faster to do the swaps on indices than on the objects themselves, and only move the objects in one final pass. I'd just like to know if I could do this last pass without allocating memory for a separate array.
Edit: I am not asking how to do a sort in O(N) time, but rather how to do the post-sort rearranging in O(N) time with O(1) space. Sorry for not making this clear.
I think this should do:
static <T> void arrange(T[] data, int[] p) {
boolean[] done = new boolean[p.length];
for (int i = 0; i < p.length; i++) {
if (!done[i]) {
T t = data[i];
for (int j = i;;) {
done[j] = true;
if (p[j] != i) {
data[j] = data[p[j]];
j = p[j];
} else {
data[j] = t;
break;
}
}
}
}
}
Note: This is Java. If you do this in a language without garbage collection, be sure to delete done.
If you care about space, you can use a BitSet for done. I assume you can afford an additional bit per element because you seem willing to work with a permutation array, which is several times that size.
This algorithm copies instances of T n + k times, where k is the number of cycles in the permutation. You can reduce this to the optimal number of copies by skipping those i where p[i] = i.
The approach is to follow the "permutation cycles" of the permutation, rather than indexing the array left-to-right. But since you do have to begin somewhere, everytime a new permutation cycle is needed, the search for unpermuted elements is left-to-right:
// Pseudo-code
N : integer, N > 0 // N is the number of elements
swaps : integer [0..N]
data[N] : array of object
permute[N] : array of integer [-1..N] denoting permutation (used element is -1)
next_scan_start : integer;
next_scan_start = 0;
while (swaps < N )
{
// Search for the next index that is not-yet-permtued.
for (idx_cycle_search = next_scan_start;
idx_cycle_search < N;
++ idx_cycle_search)
if (permute[idx_cycle_search] >= 0)
break;
next_scan_start = idx_cycle_search + 1;
// This is a provable invariant. In short, number of non-negative
// elements in permute[] equals (N - swaps)
assert( idx_cycle_search < N );
// Completely permute one permutation cycle, 'following the
// permutation cycle's trail' This is O(N)
while (permute[idx_cycle_search] >= 0)
{
swap( data[idx_cycle_search], data[permute[idx_cycle_search] )
swaps ++;
old_idx = idx_cycle_search;
idx_cycle_search = permute[idx_cycle_search];
permute[old_idx] = -1;
// Also '= -idx_cycle_search -1' could be used rather than '-1'
// and would allow reversal of these changes to permute[] array
}
}
Do you mean that you have an array of objects O[1..N] and then you have an array P[1..N] that contains a permutation of numbers 1..N and in the end you want to get an array O1 of objects such that O1[k] = O[P[k]] for all k=1..N ?
As an example, if your objects are letters A,B,C...,Y,Z and your array P is [26,25,24,..,2,1] is your desired output Z,Y,...C,B,A ?
If yes, I believe you can do it in linear time using only O(1) additional memory. Reversing elements of an array is a special case of this scenario. In general, I think you would need to consider decomposition of your permutation P into cycles and then use it to move around the elements of your original array O[].
If that's what you are looking for, I can elaborate more.
EDIT: Others already presented excellent solutions while I was sleeping, so no need to repeat it here. ^_^
EDIT: My O(1) additional space is indeed not entirely correct. I was thinking only about "data" elements, but in fact you also need to store one bit per permutation element, so if we are precise, we need O(log n) extra bits for that. But most of the time using a sign bit (as suggested by J.F. Sebastian) is fine, so in practice we may not need anything more than we already have.
If you didn't mind allocating memory for an extra hash of indexes, you could keep a mapping of original location to current location to get a time complexity of near O(n). Here's an example in Ruby, since it's readable and pseudocode-ish. (This could be shorter or more idiomatically Ruby-ish, but I've written it out for clarity.)
#!/usr/bin/ruby
objects = ['d', 'e', 'a', 'c', 'b']
order = [2, 4, 3, 0, 1]
cur_locations = {}
order.each_with_index do |orig_location, ordinality|
# Find the current location of the item.
cur_location = orig_location
while not cur_locations[cur_location].nil? do
cur_location = cur_locations[cur_location]
end
# Swap the items and keep track of whatever we swapped forward.
objects[ordinality], objects[cur_location] = objects[cur_location], objects[ordinality]
cur_locations[ordinality] = orig_location
end
puts objects.join(' ')
That obviously does involve some extra memory for the hash, but since it's just for indexes and not your "fairly large" objects, hopefully that's acceptable. Since hash lookups are O(1), even though there is a slight bump to the complexity due to the case where an item has been swapped forward more than once and you have to rewrite cur_location multiple times, the algorithm as a whole should be reasonably close to O(n).
If you wanted you could build a full hash of original to current positions ahead of time, or keep a reverse hash of current to original, and modify the algorithm a bit to get it down to strictly O(n). It'd be a little more complicated and take a little more space, so this is the version I wrote out, but the modifications shouldn't be difficult.
EDIT: Actually, I'm fairly certain the time complexity is just O(n), since each ordinality can have at most one hop associated, and thus the maximum number of lookups is limited to n.
#!/usr/bin/env python
def rearrange(objects, permutation):
"""Rearrange `objects` inplace according to `permutation`.
``result = [objects[p] for p in permutation]``
"""
seen = [False] * len(permutation)
for i, already_seen in enumerate(seen):
if not already_seen: # start permutation cycle
first_obj, j = objects[i], i
while True:
seen[j] = True
p = permutation[j]
if p == i: # end permutation cycle
objects[j] = first_obj # [old] p -> j
break
objects[j], j = objects[p], p # p -> j
The algorithm (as I've noticed after I wrote it) is the same as the one from #meriton's answer in Java.
Here's a test function for the code:
def test():
import itertools
N = 9
for perm in itertools.permutations(range(N)):
L = range(N)
LL = L[:]
rearrange(L, perm)
assert L == [LL[i] for i in perm] == list(perm), (L, list(perm), LL)
# test whether assertions are enabled
try:
assert 0
except AssertionError:
pass
else:
raise RuntimeError("assertions must be enabled for the test")
if __name__ == "__main__":
test()
There's a histogram sort, though the running time is given as a bit higher than O(N) (N log log n).
I can do it given O(N) scratch space -- copy to new array and copy back.
EDIT: I am aware of the existance of an algorithm that will proceed through. The idea is to perform the swaps on the array of integers 1..N while at the same time mirroring the swaps on your array of large objects. I just cannot find the algorithm right now.
The problem is one of applying a permutation in place with minimal O(1) extra storage: "in-situ permutation".
It is solvable, but an algorithm is not obvious beforehand.
It is described briefly as an exercise in Knuth, and for work I had to decipher it and figure out how it worked. Look at 5.2 #13.
For some more modern work on this problem, with pseudocode:
http://www.fernuni-hagen.de/imperia/md/content/fakultaetfuermathematikundinformatik/forschung/berichte/bericht_273.pdf
I ended up writing a different algorithm for this, which first generates a list of swaps to apply an order and then runs through the swaps to apply it. The advantage is that if you're applying the ordering to multiple lists, you can reuse the swap list, since the swap algorithm is extremely simple.
void make_swaps(vector<int> order, vector<pair<int,int>> &swaps)
{
// order[0] is the index in the old list of the new list's first value.
// Invert the mapping: inverse[0] is the index in the new list of the
// old list's first value.
vector<int> inverse(order.size());
for(int i = 0; i < order.size(); ++i)
inverse[order[i]] = i;
swaps.resize(0);
for(int idx1 = 0; idx1 < order.size(); ++idx1)
{
// Swap list[idx] with list[order[idx]], and record this swap.
int idx2 = order[idx1];
if(idx1 == idx2)
continue;
swaps.push_back(make_pair(idx1, idx2));
// list[idx1] is now in the correct place, but whoever wanted the value we moved out
// of idx2 now needs to look in its new position.
int idx1_dep = inverse[idx1];
order[idx1_dep] = idx2;
inverse[idx2] = idx1_dep;
}
}
template<typename T>
void run_swaps(T data, const vector<pair<int,int>> &swaps)
{
for(const auto &s: swaps)
{
int src = s.first;
int dst = s.second;
swap(data[src], data[dst]);
}
}
void test()
{
vector<int> order = { 2, 3, 1, 4, 0 };
vector<pair<int,int>> swaps;
make_swaps(order, swaps);
vector<string> data = { "a", "b", "c", "d", "e" };
run_swaps(data, swaps);
}

Resources