Divide list into two equal parts algorithm - algorithm

Related questions:
Algorithm to Divide a list of numbers into 2 equal sum lists
divide list in two parts that their sum closest to each other
Let's assume I have a list, which contains exactly 2k elements. Now, I'm willing to split it into two parts, where each part has a length of k while trying to make the sum of the parts as equal as possible.
Quick example:
[3, 4, 4, 1, 2, 1] might be splitted to [1, 4, 3] and [1, 2, 4] and the sum difference will be 1
Now - if the parts can have arbitrary lengths, this is a variation of the Partition problem and we know that's it's weakly NP-Complete.
But does the restriction about splitting the list into equal parts (let's say it's always k and 2k) make this problem solvable in polynomial time? Any proofs to that (or a proof scheme for the fact that it's still NP)?

It is still NP complete. Proof by reduction of PP (your full variation of the Partition problem) to QPP (equal parts partition problem):
Take an arbitrary list of length k plus additional k elements all valued as zero.
We need to find the best performing partition in terms of PP. Let us find one using an algorithm for QPP and forget about all the additional k zero elements. Shifting zeroes around cannot affect this or any competing partition, so this is still one of the best performing unrestricted partitions of the arbitrary list of length k.

Related

Algorithm to sort X number in batches of Y

Could somebody direct me to an algorithm that I can use to sort X number in batches of Y. Meaning that you can only compare Y numbers at the same time, but you can do that multiple times.
E.g.
There are X=100 statements and a respondent must sort them according to how relevant they are to her in such a way that she will only see and sort Y=9 statements at a time, but will do that multiple times.
From your hypothetical, I believe you are willing to do a lot of work to figure out the next comparison set (because that is done by computer), and would like as few comparisons as possible (because that is a human).
So the idea of the approach that I will outline is a greedy heuristic that attempts to maximize how much information each comparison gives us. It is complicated, but should do very well.
The first thing we need is how to measure information. Here is the mathematical theory. Suppose that we have a biased coin with a probability p of coming up heads. The information in it comings up heads is - log2(p). The information in it coming up tails is - log2(1-p). (Note that log of a number between 0 and 1 is negative, and the negative of a negative is positive. So information is always positive.) If you use an efficient encoding and have many flips to encode, the sum of the information of a sequence of flips is how many bits you need to send to communicate it.
The expected information of a single flip is therefore - p log2(p) - (1-p) log2(1-p).
So the idea is to pick a comparison set such that sorting it gives us as much information as possible about the final sort that we don't already have. But how do we estimate how much is not known about a particular pair? For example if I sort 2 groups of 5, the top of one group is unlikely to be less than the bottom of the other. It could be, but there is much less information in that comparison than comparing the two middle elements with each other. How do we capture that?
My idea for how to do that is to do a series of topological sorts to get a sense. In particular you do the first topological sort randomly. The second topological sort you try to make as different as possible by, at every choice, choosing the element which had largest rank the last time. The third topological sort you choose the element whose sum of ranks in the previous sorts was as large as possible. And so on. Do this 20x or so.
Now for any pair of elements we can just look at how often they disagree in our sorts to estimate a probability that one is really larger than the other. We can turn that into an expected entropy with the formula from before.
So we start the comparison set with the element with the largest difference between its maximum and minimum rank in the sorts.
The second element is the one that has the highest entropy with the first, breaking ties by the largest difference between its minimum and maximum rank in the sorts.
The third is the one whose sum of entropies with the first two is the most, again breaking ties in the same way.
The exact logic that the algorithm will follow is, of course, randomized. In fact you're doing O(k^2 n) work per comparison set that you find. But it will on average finish with surprisingly few comparison sets.
I don't have a proof, but I suspect that you will on average only need the theoretically optimal O(log(n!) / log(k!)) = O(n log(n) / (k log(k))) comparisons. For k=2 my further suspicion is that it will give a solution that is on average more efficient than merge sort.
At each round, you'll sort floor(X/Y) batches of Y elements and one batch of X mod Y elements.
Suppose for simplicity that the input is given as an array A[1...X].
At the first round, the batches will be A[1...Y], A[Y+1...2Y], ..., A[(floor(X/Y)-1)Y+1...floor(X/Y)Y], A[floor(X/Y)Y+1...X].
For the second round, shift these ranges right by Y/2 places (you can use wrap-around if you like, though for simplicity I will simply assume the first Y/2 elements will be left alone in even-numbered iterations). So, the ranges could be A[Y/2+1...3Y/2], A[3Y/2+1...5Y/2], etc.. The next round will repeat the ranges of the first, and the round after that will repeat the ranges of the second, and so on. How many iterations are needed in the worst case to guarantee a fully-sorted list? Well, in the worst case, the maximum element must migrate from the beginning to the end, and since it takes two iterations for an element to migrate one full odd-iteration section (see below) it stands to reason that it takes 2*ceiling(X/Y) iterations in total for an element at the front to get to the end.
Example:
X=11
Y=3
A = [7, 2, 4, 5, 2, 1, 6, 2, 3, 5, 6]
[7,2,4] [5,2,1] [6,2,3] [5,6] => [2,4,7] [1,2,5] [2,3,6] [5,6]
2 [4,7,1] [2,5,2] [3,6,5] [6] => 2 [1,4,7] [2,2,5] [3,5,6] [6]
[2,1,4] [7,2,2] [5,3,5] [6,6] => [1,2,4] [2,2,7] [3,5,5] [6,6]
1 [2,4,2] [2,7,3] [5,5,6] [6] => 1 [2,2,4] [2,3,7] [5,5,6] [6]
[1,2,2] [4,2,3] [7,5,5] [6,6] => [1,2,2] [2,3,4] [5,5,7] [6,6]
1 [2,2,2] [3,4,5] [5,7,6] [6] => 1 [2,2,2] [3,4,5] [5,6,7] [6]
[1,2,2] [2,3,4] [5,5,6] [7,6] => [1,2,2] [2,3,4] [5,5,6] [6,7]
1 [2,2,2] [3,4,5] [5,6,6] [7] => no change, termination condition
This might seem a little silly, but if you have an efficient way to sort small groups and a lot of parallelism available this could be pretty nifty.

Minimal number of cuts to partition sequence into pieces that can form a non-decreasing sequence

I have N integers, for example 3, 1, 4, 5, 2, 8, 7. There may be some duplicates. I want to divide this sequence into contiguous subsequences such that we can form from them non-decreasing sequence. How to calculate minimal number of cuts? For the example mentioned above, the answer is 6, because we can partition this sequence into {3}, {1}, {4, 5}, {2}, {7}, {8} and then form {1, 2, 3, 4, 5, 7, 8}. What is the fastest way to do this?
Does anyone know how to solve it assuming that some numbers may be equal?
I would cut the array into non-decreasing segments at points where the values decrease, and then use these segments as input into a (single) merge phase - as in a sort-merge - keeping with the same segment, where possible, in the case of ties. Create additional locations for cuts when you have to switch from one segment to another.
The output is sorted, so this produces enough cuts to do the job. Cuts are produced at points where the sequence decreases, or at points where a gap must be created because the original sequence jumps across a number present elsewhere - so no sequence without all of these cuts can be rearranged into sorted order.
Worst case for the merge overhead is if the initial sequence is decreasing. If you use a heap to keep track of what sequences to pick next then this turns into heapsort with cost n log n. Handle ties by pulling all occurrences of the same value from the heap and only then deciding what to do.
This approach works if the list does not contain duplicates. Perhaps those could be taken care of efficiently first.
We can compute the permutation inversion vector in O(n * log n) time and O(n) space using a Fenwick tree. Continuous segments of the vector with the same number can represent sections that need not be cut. Unfortunately, they can also return false positives, like,
Array: {8,1,4,5,7,6,3}
Vector: 0,1,1,1,1,2,5
where the 6 and 3 imply cuts in the sequence, [1,4,5,7]. To counter this, we take a second inversion vector representing the number of smaller elements following each element. Continuous segments parallel in both vectors need not be cut:
Array: {8,1,4,5,7,6,3,0}
Vector: 0,1,1,1,1,2,5,7 // # larger preceding
Vector: 7,1,2,2,3,2,1,0 // # smaller following
|---| // not cut
Array: {3,1,4,5,2,8,7}
Vectors: 0,1,0,0,3,0,1
2,0,1,1,0,1,0
|---| // not cut
Array: {3,1,2,4}
Vectors: 0,1,1,0
2,0,0,0
|---| // not cut

Algorithm to find first sequence of integers that sum to certain value

I have a list of numbers and I have a sum value. For instance,
list = [1, 2, 3, 5, 7, 11, 10, 23, 24, 54, 79 ]
sum = 20
I would like to generate a sequence of numbers taken from that list, such that the sequence sums up to that target. In order to help achieve this, the sequence can be of any length and repetition is allowed.
result = [2, 3, 5, 10] ,or result = [1, 1, 2, 3, 3, 5, 5] ,or result = [10, 10]
I've been doing a lot of research into this problem and have found the subset sum problem to be of interest. My problem is, in a few ways, similar to the subset sum problem in that I would like to find a subset of numbers that produces the targeted sum.
However, unlike the subset sum problem which finds all sets of numbers that sum up to the target (and so runs in exponential time if brute forcing), I only want to find one set of numbers. I want to find the first set that gives me the sum. So, in a certain sense, speed is a factor.
Additionally, I would like there to be some degree of randomness (or pseudo-randomness) to the algorithm. That is, should I run the algorithm using the same list and sum multiple times, I should get a different set of numbers each time.
What would be the best algorithm to achieve this?
Additional Notes:
What I've achieved so far is using a naive method where I cycle through the list adding it to every combination of values. This obviously takes a long time and I'm currently not feeling too happy about it. I'm hoping there is a better way to do this!
If there is no sequence that gives me the exact sum, I'm satisfied with a sequence that gives me a sum that is as close as possible to the targeted sum.
As others said, this is a NP-problem.
However, this doesn't mean small improvements aren't possible:
Is 1 in the list? [1,1,1,1...] is the solution. O(1) in a sorted list
Remove list element bigger than the target sum. O(n)
Is there any list element x with (x%sum)==0 ? Again, easy solution. O(n)
Are there any list elements x,y with (x%y)==0 ? Remove x. O(n^2)
(maybe even: Are there any list elements x,y,z with (x%y)==z or (x+y)==z ? Remove x. O(n^3))
Before using the full recursion, try if you can get the sum
just with the smallest even and smallest odd number.
...
Subset Sum problem isn't about finding all subsets, but rather about determining if there is some subset. It is a decision problem. All problems in NP are like this. And even this simpler problem is NP-complete.
This means that if you want an exact answer (the subset must sum exactly some value) you won't be able to do much better than the any subset sum algorithm (it is exponential unless P=NP).
I would attempt to reduce the problem to a brute-force search of a smaller set.
Sort the list smallest to largest.
Keep a sum and result list.
Repeat {
Draw randomly from the subset of list less than target - sum.
Increment sum by drawn value, add drawn value to result list.
} until list[0] > sum or sum == 0
If sum != 0, brute force search for small combinations from list that match the difference between sum and small combinations of result.
This approach may fail to find valid solutions, even if they exist. It can, however, quickly find a solution or quickly fail before having to resort to a slower brute force approach using the entire set at a greater depth.
This is a greedy approach to the problem:
Without 'randomness':
Obtain the single largest number in the set that is smaller than your desired sum- we'll name it X. Given it's ordered, at best it's O(1), and O(N) at worst if the sum is 2.
As you can repeat the value- say c times, do so as many times until you get closest to the sum, but be careful! Create a range of values- essentially now you'll be finding another sum! You'll now be find numbers that add up to R = (sum - X * c). So find the largest number smaller than R. Check if R - (number you just found) = 0 or if any [R - (number you just found)] % (smaller #s) == 0.
If it becomes R > 0, make partial sums of the smaller numbers less than R (this will not be more than 5 ~ 10 computations because of the nature of this algorithm). See if these would then satisfy it.
If that step makes R < 0, remove one X and start the process again.
With 'randomness':
Just get X randomly! :-)
Note: This would work best if you have a few single digit numbers.

Check for duplicate subsequences of length >= N in sequence

I have a sequence of values and I want to know if it contains an repeated subsequences of a certain minimum length. For instance:
1, 2, 3, 4, 5, 100, 99, 101, 3, 4, 5, 100, 44, 99, 101
Contains the subsequence 3, 4, 5, 100 twice. It also contains the subsequence 99, 101 twice, but that subsequence is two short to care about.
Is there an efficient algorithm for checking the existence of such a subsequence? I'm not especially interested in location the sequences (though that would be helpful for verification), I'm primarily just interested in a True/False answer, given a sequence and a minimum subsequence length.
My only approach so far is to brute force search it: for each item in the sequence, find all the other locations where the item occurs (already at O(N^2)), and then walk forward one step at a time from each location and see if the next item matches, and keep going until I find a mismatch or find a matching subsequence of sufficient length.
Another thought I had but haven't been able to develop into an actual approach is to build a tree of all the sequences, so that each number is a node, and a child of its the number that preceded it, whereever that node happens to already be in the tree.
There are O(k) solutions (k - the length of the whole sequence) for any value of N.
Solution #1: Build a suffix tree for the input sequence(using Ukkonen's algorithm). Iterate over the nodes with two or more children and check if at least one of them has depth >= N.
Solution #2: Build a suffix automaton for the input sequence.Iterate over all the states which right context contains at least two different strings and check if at least one of those nodes has distance >= N from the initial state of the automaton.
Solution #3:Suffix array and the longest common prefix technique can also be used(build the suffix array for input sequence , compute the longest common prefix array, check that there is a pair of adjacent suffices with common prefix with length at least N).
These solutions have O(k) time complexity under the assumption that alphabet size is constant(alphabet consists of all elements of the input sequence).
If it is not the case, it is still possible to obtain O(k log k) worst case time complexity(by storing all transitions in a tree or in an automaton in a map) or O(k) on average using hashmap.
P.S I use terms string and sequence interchangeably here.
If you only care about subsequences of length exactly N (for example, if just want to check that there are no duplicates), then there is a quadratic solution: use the KMP algorithm for every subsequence.
Let's assume that there are k elements in the whole sequence.
For every subsequence of length N (O(k) of them):
Build its failure function (takes O(N))
Search for it in the remainder of the sequence (takes O(k))
So, assuming N << k, the whole algorithm is indeed O(k^2).
Since your list is unordered, you're going to have to visit every item at least once.
What I'm thinking is that you first go through your list and create a dictionary where you store the number as a key along with all the indices it appears in your sequence. Like:
Key: Indices
1: 0
2: 1
3: 2, 8
....
Where the number 1 appears at index 0, the number 2 appears at index 1, the number 3 appears at indices 2 and 8, and so on.
With that created you can then go through the dictionary keys and start comparing it against the sequences at the other locations. This should save on some of the brute force since you don't have to revisit each number through the initial sequence each time.

finding similarity of sequences in ruby

I want to find the similarity of two sequences in Ruby based solely on the quantity of shared values. The sequential position of the values should be irrelevant. What should also be irrelevant is whether one sequence has any values that the other sequence does not have. Levenshtein distance was suggested to me, but it computes the number of edits required to make the sequences identical. Here's a simple example of where the flaw is there:
[1,2,3,4,5]
[2,3,4,5,6,7,8,9]
#Lev distance is 5
[1,2,3,4,5]
[6,7,8,9,10]
#Lev distance is 5
In a perfect world the first set would have much greater similarity than the second set. The crude, obvious solution is to use nested loops to check each value of the first sequence against each value of the second. Is there a more efficient way?
You can do an intersection for a pair of arrays using an & like this:
a = [1,2,3,4,5]
b = [2,3,4,5,6,7,8,9]
common = a & b # => [2, 3, 4, 5]
common.size # => 4
Is this what you are looking for?
If the sequences are sorted (or you sort them), all you have to do is walk down both lists, incrementing the similarity counter and popping off both values if they match. If they don't match, you pop off the smaller value, and continue until one list is empty. The complexity of this is O(n log n) for the sorting plus O(n) for the walk, where n is the sum of the lengths of the two lists.
You could also loop through each list, counting the incidence of each number (so you end up with a list of the counts of each value). Then you could compare these quantities, incrementing the similarity counter by the lesser count for each value.

Resources