Sorting algorithm based on subset inversion - algorithm

I'm looking for a sorting algorithm based on subset inversion. It's like pancake sort, only instead of taking all the pancakes on top of the spatula, you can just invert any subset you want. Length of the subset doesn't matter.
Like this:
http://www.yourgenome.org/sites/default/files/illustrations/diagram/dna_mutations_inversion_yourgenome.png
So we can't simply swap numbers without inverting everything in between.
We're doing this to determine how one subspecies of fruitfly can mutate into the other. Both have the same genes but in a different order. The second subspecies' genome is 'sorted', i.e. the gene numbers are 1-25. The first subspecies genome is unsorted. Hence, we're looking for a sorting algorithm.
This is the "genome" we're looking at (though we should be able to have this work on all lists of numbers):
[23, 1, 2, 11, 24, 22, 19, 6, 10, 7, 25, 20, 5, 8, 18, 12, 13, 14, 15, 16, 17, 21, 3, 4, 9];
We're looking at two separate problems:
1) To sort a list of 25 numbers with the least amount of inversions
2) To sort a list of 25 numbers with the least amount of numbers moved
We also want to establish both upper and lower bounds for both.
We've already found a way to sort like this by just going from left to right, searching for the next lowest value and inverting everything in between, but we're absolutely certain we should be able to do this faster. However, we still haven't found any other methods so I'm asking for your help!
UPDATE: the method we currently use is based on the above method
but instead works both ways. It looks at the next elements needed
for both ends (e.g. 1 and 25 at the beginning) and then calculates
which inversion would be cheapest. All values at the ends can be
ignored for the rest of the algorithm because they get put into the
correct place immediately. Our first method took 18/19 steps and 148
genes, and this one does it in 17 steps and 101 genes. For both
optimalisation tactics (the two mentioned above), this is a better
method. It is however not cheaper in terms of code and processing.
Right now, we're working in Python because we have most experience with that, but I'd be happy with any pseudocode ideas on how we can more efficiently tackle this. If you think another language might be better suited, please let me know. Pseudocode, ideas, thoughts and actual code are all welcome!
Thanks in advance!

Regarding the first question: Do you know (and care about) which of the two strands the genes are on?
If so, you're in luck: This is called the inversion distance between signed permutations problem, and there is a linear-time algorithm for it: http://www.ncbi.nlm.nih.gov/pubmed/11694179. I haven't looked at the details.
If not, then unfortunately (as described on p. 2 of that paper) the problem is NP-hard, so it's very unlikely that any algorithm exists that is efficient (polynomial-time) in the worst case.
Regarding the second question: Assuming you mean that you want to find the minimum number of swaps needed to sort a list of numbers, you should be able to find solutions to this by searching here on SO and elsewhere. I think this is a clear and concise explanation. You can also use the optimal solution to this problem to get an upper bound for your first question: Any swap of positions i and j can be simulated using the two interval reversals (i, j) and (i+1, j-1). (This upper bound might be very bad, though, and in particular could be worse than your existing greedy algorithm.)

I think what you're looking for for the second question is the minimum number of swaps of adjacent elements to sort a sequence, which is equal to the number of inversions in the sequence (where a[i] > a[j] and i < j).
The first question seems quite a bit more complicated to me. One potential heuristic might be to think of the subset inversion as similar to the adjacent swap of more than one element. For example, if you've managed to get a sequence to this position,
5,6,1,2,3,4,7,8
we can "adjacent swap" indexes [0,1] with [2,3] (so inverting [0,1,2,3]),
2,1,6,5,3,4,7,8
and then [2,3] with [4,5] (inverting [2,3,4,5]),
2,1,4,3,5,6,7,8
and arrive at a sequence that now has significantly less element inversions, meaning less single adjacent swaps are needed to now complete the sort.
So maybe attempting to quantify inversions (in the sense of a[i] > a[j] and i < j) of sections rather than single elements could help move in the direction of estimating or building a method for the first question.

Related

Difficulty in thinking a divide and conquer approach

I am self-learning algorithms. As we know Divide and Conquer is one of the algorithm design paradigms. I have studied mergeSort, QuickSort, Karatsuba Multiplication, counting inversions of an array as examples of this particular design pattern. Although it sounds very simple, divides the problems into subproblems, solves each subproblem recursively, and merges the result of each of them, I found it very difficult to develop an idea of how to apply that logic to a new problem. To my understanding, all those above-mentioned canonical examples come up with a very clever trick to solve the problem. For example, I am trying to solve the following problem:
Given a sequence of n numbers such that the difference between two consecutive numbers is constant, find the missing term in logarithmic time.
Example: [5, 7, 9, 11, 15]
Answer: 13
First, I came up with the idea that it can be solved using the divide and conquer approach as the naive approach will take O(n) time. From my understanding of divide and conquer, this is how I approached:
The original problem can be divided into two independent subproblems. I can search for the missing term in the two subproblems recursively. So, I first divide the problem.
leftArray = [5,7,9]
rightArray = [11, 15]
Now it says, I need to solve the subproblems recursively until it becomes trivial to solve. In this case, the subproblem becomes of size 1. If there is only one element, there are 0 missing elements. Now to combine the result. But I am not sure how to do it or how it will solve my original problem.
Definitely, I am missing something crucial here. My question is how to approach when solving this type of divide and conquer problem. Should I come up with a trick like a mergeSort or QuickSort? The more I see the solution to this kind of problem, it feels I am memorizing the approach to solve, not understanding and each problem solves it differently. Any help or suggestion regarding the mindset when solving divide and conquer would be greatly appreciated. I have been trying for a long time to develop my algorithmic skill but I improved very little. Thanks in advance.
You have the right approach. The only missing part is an O(1) way to decide which side you are discarding.
First, note that the numbers in your problem must be ordered, otherwise you can't do better than O(n). There also needs to be at least three numbers, otherwise you wouldn't figure out the "step".
With this understanding in place, you can determine the "step" in O(1) time by examining the initial three terms, and see what's the difference between the consecutive ones. Two outcomes are possible:
Both differences are the same, and
One difference is twice as big as the other.
Case 2 hands you a solution by luck, so we will consider only the first case from now on. With the step in hand, you can determine if the range has a gap in it by subtracting the endpoints, and comparing the result to the number of gaps times the step. If you arrive at the same result, the range does not have a missing term, and can be discarded. When both halves can be discarded, the gap is between them.
As #Sergey Kalinichenko points out, this assumes the incoming set is ordered
However, if you're certain the input is ordered (which is likely in this case) observe the nth position's value to be start + jumpsize * index; this allows you to bisect to find where it shifts
Example: [5, 7, 9, 11, 15]
Answer: 13
start = 5
jumpsize = 2
check midpoint: 5 * 2 * 2 -> 9
this is valid, so the shift must be after the midpoint
recurse
You can find the jumpsize by checking the first 3 values
a, b, c = (language-dependent retrieval)
gap1 = b - a
gap2 = c - b
if gap1 != gap2:
if (value at 4th index) - c == gap1:
missing value is b + gap1 # 2nd gap doesn't match
else:
missing value is a + gap2 # 1st gap doesn't match
bisect remaining values

Dynamic algorithm to multiply elements in a sequence two at a time and find the total

I am trying to find a dynamic approach to multiply each element in a linear sequence to the following element, and do the same with the pair of elements, etc. and find the sum of all of the products. Note that any two elements cannot be multiplied. It must be the first with the second, the third with the fourth, and so on. All I know about the linear sequence is that there are an even amount of elements.
I assume I have to store the numbers being multiplied, and their product each time, then check some other "multipliable" pair of elements to see if the product has already been calculated (perhaps they possess opposite signs compared to the current pair).
However, by my understanding of a linear sequence, the values must be increasing or decreasing by the same amount each time. But since there are an even amount of numbers, I don't believe it is possible to have two "multipliable" pairs be the same (with potentially opposite signs), due to the issue shown in the following example:
Sequence: { -2, -1, 0, 1, 2, 3 }
Pairs: -2*-1, 0*1, 2*3
Clearly, since there are an even amount of pairs, the only case in which the same multiplication may occur more than once is if the elements are increasing/decreasing by 0 each time.
I fail to see how this is a dynamic programming question, and if anyone could clarify, it would be greatly appreciated!
A quick google for define linear sequence gave
A number pattern which increases (or decreases) by the same amount each time is called a linear sequence. The amount it increases or decreases by is known as the common difference.
In your case the common difference is 1. And you are not considering any other case.
The same multiplication may occur in the following sequence
Sequence = {-3, -1, 1, 3}
Pairs = -3 * -1 , 1 * 3
with a common difference of 2.
However this is not necessarily to be solved by dynamic programming. You can just iterate over the numbers and store the multiplication of two numbers in a set(as a set contains unique numbers) and then find the sum.
Probably not what you are looking for, but I've found a closed solution for the problem.
Suppose we observe the first two numbers. Note the first number by a, the difference between the numbers d. We then count for a total of 2n numbers in the whole sequence. Then the sum you defined is:
sum = na^2 + n(2n-1)ad + (4n^2 - 3n - 1)nd^2/3
That aside, I also failed to see how this is a dynamic problem, or at least this seems to be a problem where dynamic programming approach really doesn't do much. It is not likely that the sequence will go from negative to positive at all, and even then the chance that you will see repeated entries decreases the bigger your difference between two numbers is. Furthermore, multiplication is so fast the overhead from fetching them from a data structure might be more expensive. (mul instruction is probably faster than lw).

Algorithm to find first sequence of integers that sum to certain value

I have a list of numbers and I have a sum value. For instance,
list = [1, 2, 3, 5, 7, 11, 10, 23, 24, 54, 79 ]
sum = 20
I would like to generate a sequence of numbers taken from that list, such that the sequence sums up to that target. In order to help achieve this, the sequence can be of any length and repetition is allowed.
result = [2, 3, 5, 10] ,or result = [1, 1, 2, 3, 3, 5, 5] ,or result = [10, 10]
I've been doing a lot of research into this problem and have found the subset sum problem to be of interest. My problem is, in a few ways, similar to the subset sum problem in that I would like to find a subset of numbers that produces the targeted sum.
However, unlike the subset sum problem which finds all sets of numbers that sum up to the target (and so runs in exponential time if brute forcing), I only want to find one set of numbers. I want to find the first set that gives me the sum. So, in a certain sense, speed is a factor.
Additionally, I would like there to be some degree of randomness (or pseudo-randomness) to the algorithm. That is, should I run the algorithm using the same list and sum multiple times, I should get a different set of numbers each time.
What would be the best algorithm to achieve this?
Additional Notes:
What I've achieved so far is using a naive method where I cycle through the list adding it to every combination of values. This obviously takes a long time and I'm currently not feeling too happy about it. I'm hoping there is a better way to do this!
If there is no sequence that gives me the exact sum, I'm satisfied with a sequence that gives me a sum that is as close as possible to the targeted sum.
As others said, this is a NP-problem.
However, this doesn't mean small improvements aren't possible:
Is 1 in the list? [1,1,1,1...] is the solution. O(1) in a sorted list
Remove list element bigger than the target sum. O(n)
Is there any list element x with (x%sum)==0 ? Again, easy solution. O(n)
Are there any list elements x,y with (x%y)==0 ? Remove x. O(n^2)
(maybe even: Are there any list elements x,y,z with (x%y)==z or (x+y)==z ? Remove x. O(n^3))
Before using the full recursion, try if you can get the sum
just with the smallest even and smallest odd number.
...
Subset Sum problem isn't about finding all subsets, but rather about determining if there is some subset. It is a decision problem. All problems in NP are like this. And even this simpler problem is NP-complete.
This means that if you want an exact answer (the subset must sum exactly some value) you won't be able to do much better than the any subset sum algorithm (it is exponential unless P=NP).
I would attempt to reduce the problem to a brute-force search of a smaller set.
Sort the list smallest to largest.
Keep a sum and result list.
Repeat {
Draw randomly from the subset of list less than target - sum.
Increment sum by drawn value, add drawn value to result list.
} until list[0] > sum or sum == 0
If sum != 0, brute force search for small combinations from list that match the difference between sum and small combinations of result.
This approach may fail to find valid solutions, even if they exist. It can, however, quickly find a solution or quickly fail before having to resort to a slower brute force approach using the entire set at a greater depth.
This is a greedy approach to the problem:
Without 'randomness':
Obtain the single largest number in the set that is smaller than your desired sum- we'll name it X. Given it's ordered, at best it's O(1), and O(N) at worst if the sum is 2.
As you can repeat the value- say c times, do so as many times until you get closest to the sum, but be careful! Create a range of values- essentially now you'll be finding another sum! You'll now be find numbers that add up to R = (sum - X * c). So find the largest number smaller than R. Check if R - (number you just found) = 0 or if any [R - (number you just found)] % (smaller #s) == 0.
If it becomes R > 0, make partial sums of the smaller numbers less than R (this will not be more than 5 ~ 10 computations because of the nature of this algorithm). See if these would then satisfy it.
If that step makes R < 0, remove one X and start the process again.
With 'randomness':
Just get X randomly! :-)
Note: This would work best if you have a few single digit numbers.

Positioning an ordered sequence of intervals for maximum alignment with another sequence of fixed intervals

I have two sequences of intervals.
The first is fixed and non-overlapping, so something like:
[1..10], [12..15], [23..56], [72..89], ...
The second is not fixed, so it's just an ordered list of interval lengths:
[7, 2, 5, 26, ...]
The task at hand is to:
Place every interval from the second list at a given starting point, so that the second list becomes a list of fixed, non-overlapping intervals much like the first, while preserving its order
Find the alignment that minimizes the amount of integers that are in some interval from one of the lists but not in any interval from the other list
Very simple example:
[25..26], [58..68], [74..76], [78..86]
[10, 12]
The optimal solution is to place the interval of length 10 at [58..68] and the interval of length 12 at [74..86] which results in only the numbers 25, 26, and 77 being in one list but not the other.
The only thing I've come up with that seems mildly helpful is that if I lay down the intervals in order, I know how many 'penalties' the interval I've already created, so I have an upper bound for the score, which means I have an admissible heuristic and I can do A* search instead of looking at the entire tree. However, the total range of numbers spans from 0 to about 34M, so I'd like something better.
Any help would be hot!
OK, here's a half-thought-out answer. It should work in polynomial time, but I haven't bothered checking what the index is. It may well be possible to get a better index than the answer as outlined here. The details are left as an exercise to the reader :-) I hope it's not too unclear.
I'll define the score of a solution as the number of integers which appear in both lists of intervals. Let f(i,m) be the highest score it's possible to get using just the first i interval lengths, subject to the condition that none of your intervals goes above m. The function f, for fixed i, is essentially a (non-strictly) increasing function from the integers to a bounded subset of the integers. Therefore:
all values of f(i,m), for m > 0, are equal, with finitely many exceptions;
all values of f(i,m), for m < 0, are equal, with finitely many exceptions.
This means it's possible to represent all values of f(i,m) using a finite data structure (still considering a fixed value of i).
Now let F(i) be the value of this data structure representing all values of f(i,m). I claim that, given F(i), it is possible to calculate F(i+1). To do this, we only need to answer the following question for all x: If I place the new interval at x, how good is the best solution I can get? But we know what this is - it's just f(i,x) + the score we've got from this interval.
So if n is the number of intervals in the second list, the score of the best solution will be F(n).
To actually find the solution, you could work backwards from this.
You know what's the best score you can get. Say it's s_0. Then put the last interval as far left as possible, subject to the condition that it allows you to score s_0. That is, find the smallest m such that f(n,m) = s_0; and place the interval such that it only just stays inside the bound at m.
Then, let s_1 be the score you need to get from all the other intervals in order to get a total of s_0. Place the next-last interval as far left as possible, subject to the condition that you can still score s_1. That is, find the smallest m such that f(n,m) = s_1; and place the interval such that it only just stays inside the bound at m.
And so on...

Designing a data structure that works in O(logn) time

I'm auditing this algorithms class for work and I'm trying to do some practice problems given in class. This problem has me stumped and I just can't wrap my head around it. None of my solutions come out in O(logn) time. Can anyone help me with this problem??
Question:
Suppose that we are given a sequence of n values x1, x2, ... , xn in an arbitrary order and
seek to quickly answer repeated queries of the form: given an arbitrary pair i and j with
1 ≤ i < j ≤ n, find the smallest value in x1, ... , xj . Design a data structure that uses O(n) space and answers each query in O(log n) time.
For input of a1,a2,a3,...an , construct a node that contains minimum of (a1,..,ak) and minimum of (ak+1,..,an) where k = n/2.
Recursively construct the rest of the tree.
Now, if you want to find the minimum between ai and aj:
Identify the lowest common ancestor of i,j. Let it be k
Start with i and keep moving until you hit k. AT every iteration check if the child node was left node. If yes, then compare the right subtree's min and update current min accordingly.
Similarly, for j, check if it is right node....
At node k compare values returned by each subtree and return the min
People are overthinking this. Suppose that you start with the list:
47, 13, 55, 29, 56, 9, 17, 48, 69, 15
Make the following list of lists:
47, 13, 55, 29, 56, 9, 17, 48, 69, 15
13, 29, 9, 17, 15
13, 9, 15
9, 15
9
I leave the construction of these lists, correct usage, and proof that they provide an answer to the original question as exercises for the reader. (It might not be homework for you, but it could easily be for someone, and I don't like giving complete answers to homework questions.)
I think the crucial step is that you'll need to sort the data before hand. Then you can store the data in an array/list. Then you can run through a quick binary search in O(logn), picking out the first value that satisfies the condition (I'm assuming you meant between xi and xj, not x1 and xj).
edit: on second thought, ensuring that the value satisfies the condition may not be as trivial as I thought
The question was asked before in a slightly different way: What data structure using O(n) storage with O(log n) query time should I use for Range Minimum Queries?
Nevertheless, to quickly answer, the problem you're facing it's a well studied one - Range Minimum Query. A Segment Tree is a Data Structure that can solve the problem with O(N) space and O(logN) time requirements. You can see more details in here, where there's an explanation of the structure and the complexities involved.
Trying to explain the suggested data structure:
For every pair of numbers, calculate and keep the value of the smaller one.
For every four consecutive numbers, calculate and keep the value of the smallest of the four. This is done quickly by picking the smaller of the two pair values.
For every eight consecutive numbers, calculate and keep the value of the smallest of the eight.
And so on.
Let's say we want the smallest value of x19 to x65.
We look at the following stored values:
Smallest of x32 to x63.
Smallest of x24 to x31.
Smallest of x20 to x23.
x19.
Smallest of x64 to x65.
Then we pick the smallest of these.

Resources