What is a sorting algorithm that is robust to a faulty comparison? - algorithm

I want to sort a list of n items with a comparison sort. However, one of the comparisons made by the algorithm will be flipped from what it's supposed to be. Specifically, there is one pair of items for which the comparator function consistently gives the wrong result.
What is a efficient n*log(n) sorting algorithm that will be robust to this faulty comparison? By robust, I mean that every item is off by at most k spots from its true position, for some reasonably small k.
If possible, I'd like it to be robust in the worst case (faulty comparison chosen adversarially), but I'll settle for robust in the average case.
An example robust algorithm (that's not efficient), would be to make all n*(n-1)/2 pairwise comparisons, and place each item by how many of the comparisons they won. Then, no matter what comparison the adversary makes, each items index will be off by no more than k=1.
An example of a NON-robust algorithm is quicksort, because the adversary could just choose the largest item to be on the wrong side of the first pivot, making it on average n/2 spots off from its correct index.

TL;DR: It's possible to modify quicksort to get the following guarantee: in (expected) time O(n log n), we can do one of the following, depending on which comparison is flipped.
Perfectly sort the array.
Perfectly sort the array, except that an adjacent pair of items somewhere in the array is swapped.
Perfectly sort the array, except that three consecutive items in the array, which can be identified, are permuted.
This guarantees a maximum displacement of 2, which is as good as is theoretically possible.
I mulled over this problem for a couple of hours and everything I'm doing connects back to tournaments.
I'd like to begin by trying to reframe the question as follows. If you have a set of n items and you know the "true" results of the comparisons between them, you can represent that result as a directed graph with one node per item and edges indicating when one item compares less than another. This type of digraph is called a "tournament," since you can think of it as encoding the result of a round-robin tournament where each player plays each other player.
In the case of an honest comparator, our tournament will be acyclic, and in particular it will have the following key property: there's exactly one node of each outdegree 0, 1, 2, ..., n - 1. The idea here is that the smallest element will have outdegree n - 1 (it's smaller than everything else), while the largest element will have outdegree 0 (it's bigger than everything else). And in fact, there's a theorem that a tournament is acyclic if and only if each node in the tournament has a different outdegree. Another useful fact: in an acyclic tournament, there's an edge from U to V if and only if outdeg(U) > outdeg(V).
In the case of a "dishonest comparator," we essentially start with an acyclic tournament, then flip a single edge. Your question asked about doing approximate sorting based on this comparator, but I'd like to step back and ask a different question, which I think can then be used to answer yours more precisely. In what cases can you figure out which edge was flipped? If we can do that, then we can do even better than approximate sorting - we can "unflip" the edge and sort perfectly. On the other hand, in which cases can you not figure out which edge was flipped, and when that happens, how far from sorted will we end up? That corresponds to having to do an approximate sort because we can't recover the original ordering.
Here's a useful fact:
Theorem: Begin with an acyclic tournament and flip a single edge. Then it's possible to determine which edge was flipped if and only if the outdegrees of the two endpoints of the flipped edge originally differ by at least three.
To prove this, we'll show both directions of implication.
First, suppose that we flip an edge between two nodes X and Y whose outdegrees differ by one. When we're done, we're left with a tournament where all nodes have different outdegrees (all other nodes have their outdegrees unchanged, and if we flipped the edge (X, Y), then X and Y swap outdegrees because one goes up by one and one goes down by one). We're now left with another acyclic tournament. And in particular, we can't tell which edge we flipped, because we could have just as well flipped any edge between any pair of nodes whose outdegrees differ by one.
Next, suppose we flip an edge between nodes X and Y where the outdeg(X) = k+1 and outdeg(Y) = k-1. We now have outdeg(X) = k = outdeg(Y), and somewhere else to begin with there must have been some node Z with outdegree k as well. So at this point, we have three nodes of outdegree k (namely, X, Y, and Z), and we know that we must have flipped one of the three edges between them. But we can't tell which one it was. Specifically, flipping the XY edge, or the XZ edge, or the YZ edge would all give back acyclic tournaments. So in that case, there's no way to undo the transform. That means that any sorted ordering we get from this comparator will have those two items out of place, so we'd have a maximum distance of at least 1.
An important note for this particular case: this corresponds to the comparator creating a tournament with exactly one cycle containing the nodes X, Y, and Z. Specifically, it'll take on the form X, Z, Y, X. The problem is we can't tell whether the original ordering was (X, Z, Y), or (Z, Y, X), or (Y, X, Z), and so we'd have a maximum distance of at least 2.
And finally, suppose that we have two nodes X and Y and flip the edge XY in the case where outdeg(X) = k, outdeg(Y) = m, and k ≥ m + 3. We're now left with a tournament in which two nodes have outdegree k - 1 and two nodes have outdegree m + 1. But of those four nodes, it's guaranteed that there's exactly one pair of them that can be flipped back to produce an acyclic tournament. One way to see this: take the four nodes that now have repeated outdegrees; call them X and Y (as above) and also W and Z, and suppose we have the cycle X, W, Z, Y, X, where the only flipped edge from the original is (Y, X). What will this cycle look like? Well, since (X, W), (W, Z), and (Z, Y) are edges in the tournament that weren't flipped, back in the original tournament we have outdeg(X) > outdeg(W) > outdeg(Z) > outdeg(Y). That means that we have to have X and W having outdegree k - 1 in the new graph and Z and Y having outdegree m + 1 in the new graph. Therefore, only flipping the edge from Y to X will increase the degree of one of the degree-(k-1) nodes back up to k while also decreasing the degree of one of the degree-(m+1) nodes down to m.
Theorem: The faulty comparator will either
Behave as a real comparator, in which case we swapped two adjacent elements in the original sequence and we will never know which.
Have exactly one cycle of length three of elements whose original ordering can never be known, or
Have a cycle of length four, in which case we can identify which comparison is reversed.
With this in mind, it seems reasonable to reframe your problem in the following way:
Goal: Design an algorithm that, in time O(n log n), does one of the following things to a list of n elements given a faulty comparator that returns the wrong result when comparing two fixed elements X and Y against one another:
Perfectly sort the list.
Perfectly sort the list, except with two adjacent items swapped.
Perfectly sort the list, except with three adjacent items permuted.
Here's one possible algorithm that does this in expected O(n log n) time that's based on quicksort. The basic idea is the following: we run more or less a regular quicksort, at each point in time checking to see whether we found a triangle. If not, then either we're in case (1) or case (2). If we do find a triangle, we see whether we can identify which comparison got reversed. If we can, then we rerun quicksort, except that we "fix" the comparator in this broken case. If we can't, then we're in case (3) and just finish quicksort as usual.
The specific technique we'll use to detect a triangle works like this. Begin with a regular, vanilla quicksort: pick a pivot, partition the array into things less than the pivot and things bigger than the pivot, then recursively sort the two smaller subarrays. However, after doing so, we do one additional step: assuming the subarray we're sorting has three or more elements in it, look at the pivot p and the element just before and just after it (call those s, p, g for "smaller," "pivot," and "greater"). Then if the comparator says s < p < g < s, we've found a triangle. And in fact, we have something stronger.
Suppose that at some point in quicksort comparator does indeed compare X and Y, the mismatched items. We're assuming X < Y, but that the comparator incorrectly reports that Y < X. The only way that two items can be compared in quicksort is if one of them is a pivot element at a time when the other is in the current subarray. Without loss of generality, let's assume that X was the pivot, and that Y was compared against it.
What should happen here, assuming the comparator was honest, is that Y would be found to be larger than X, and therefore would be placed into the "bigger" subarray. But because the comparator is a lying liar who lies, instead Y gets placed into the "smaller" subarray. If we then recursively sort the "smaller" subarray and the "bigger" subarray, think about where Y will end up. It's in the "smaller" subarray but is actually bigger than X, which means it'll compare larger than everything in that "smaller" subarray. Consequently, Y will appear just before X. Now, look at the items in the "bigger" subarray. There are two options. The first is that in the "real" ordering, there's at least one value between X and Y. That value would then appear in the "bigger" subarray because it's larger than X, and in particular the first element of the "bigger" subarray would compare smaller than Y. That would mean that Y, then X, then the item immediately after X after sorting would form a triangle. The other option is that X and Y are adjacent in the true sorted ordering, which case we'd never find out (as mentioned above). This, combined with the above insight, means that
Theorem: Suppose we run quicksort, and after recursively sorting the left and right subarrays we look at the three items consisting of the pivot, the item just before it, and the item just after it to see if they form a triangle. Then if this algorithm detects a triangle, a triangle exists. Moreover, if this algorithm does not detect a triangle, then either (1) no triangle exists or (2) a triangle does exist, but the comparator was never applied to the bad pair (X, Y) and so the sorted order is correct.
With all this said and done, we can state the full algorithm that, in expected O(n log n) time, sorts the array as best as is possible.
function modifiedQuicksort(array, comparator):
if array has length 0 or 1, return.
pick a random pivot element from the array.
use the comparator to form subarrays smaller and greater based on
how elements compare against the pivot.
recursively apply modifiedQuicksort to those two arrays.
if the comparator finds a triangle formed from the last element of
smaller, the pivot, and the first element of greater, report those
three items as a triangle.
return smaller, pivot, greater.
function sortAsBestWeCan(array, comparator):
run modifiedQuicksort(array, comparator)
if it didn't report a triangle, return the result of the call.
otherwise, it reported a triangle A, B, C.
for each other item D:
if comparator(A, D) and comparator(D, B) or
comparator(B, D) and comparator(D, C) or
comparator(C, D) and comparator(D, A):
you have found a 4-cycle from A, B, C, and D.
detect which comparison is reversed.
use that knowledge plus the comparator and your favorite
O(n log n)-time sorting algorithm to perfectly sort
the input array.
otherwise, those three items are the only triangle, and the
array is sorted as well as it can be. return it.

I think I've thought up a solution.
First, do a first pass with any decent sorting algorithm you want (like quicksort), which should, at worst, result in only one item that's significantly far from where it should be.
Then, choose a width h that's at least 5.
for i from 0 to n-h, we look at the group of h items at i, i+1, ..., i+h-1. We make all h*(h-1)/2 pairwise comparisons in that group, and rearrange them by who won the most comparisons. We then increment i and move onto the next group.
Afterwards, we do the same thing, but going backwards from i=n-h to i=0.
These two extra passes will bubble up/bubble down the displaced item to be in the correct area, and uses the extra comparisons in a group of h to override the faulty single comparison.
The final number of comparisons will be O(n*log(n)) + n*h*(h-1)/2. Not sure how much better you can do.
This method also works (I think) for more than one faulty comparison. All you need to do is make sure that h is large enough to override those faulty comparisons.


How many paths of length n with the same start and end point can be found on a hexagonal grid?

Given this question, what about the special case when the start point and end point are the same?
Another change in my case is that we must move at every step. How many such paths can be found and what would be the most efficient approach? I guess this would be a random walk of some sort?
My think so far is, since we must always return to our starting point, thinking about n/2 might be easier. At every step, except at step n/2, we have 6 choices. At n/2 we have a different amount of choices depending on if n is even or odd. We also have a different amount of choices depending on where we are (what previous choices we made). For example if n is even and we went straight out, we only have one choice at n/2, going back. But if n is even and we didn't go straight out, we have more choices.
It is all the cases at this turning point that I have trouble getting straight.
Am I on the right track?
To be clear, I just want to count the paths. So I guess we are looking for some conditioned permutation?
This version of the combinatorial problem looks like it actually has a short formula as an answer.
Nevertheless, the general version, both this and the original question's, can be solved by dynamic programming in O (n^3) time and O (n^2) memory.
Consider a hexagonal grid which spans at least n steps in all directions from the target cell.
Introduce a coordinate system, so that every cell has coordinates of the form (x, y).
Let f (k, x, y) be the number of ways to arrive at cell (x, y) from the starting cell after making exactly k steps.
These can be computed either recursively or iteratively:
f (k, x, y) is just the sum of f (k-1, x', y') for the six neighboring cells (x', y').
The base case is f (0, xs, ys) = 1 for the starting cell (xs, ys), and f (0, x, y) = 0 for every other cell (x, y).
The answer for your particular problem is the value f (n, xs, ys).
The general structure of an iterative solution is as follows:
let f be an array [0..n] [-n-1..n+1] [-n-1..n+1] (all inclusive) of integers
f[0][*][*] = 0
f[0][xs][ys] = 1
for k = 1, 2, ..., n:
for x = -n, ..., n:
for y = -n, ..., n:
f[k][x][y] =
f[k-1][x-1][y] +
f[k-1][x][y-1] +
f[k-1][x+1][y] +
answer = f[n][xs][ys]
OK, I cheated here: the solution above is for a rectangular grid, where the cell (x, y) has four neighbors.
The six neighbors of a hexagon depend on how exactly we introduce a coordinate system.
I'd prefer other coordinate systems than the one in the original question.
This link gives an overview of the possibilities, and here is a short summary of that page on StackExchange, to protect against link rot.
My personal preference would be axial coordinates.
Note that, if we allow standing still instead of moving to one of the neighbors, that just adds one more term, f[k-1][x][y], to the formula.
The same goes for using triangular, rectangular, or hexagonal grid, for using 4 or 8 or some other subset of neighbors in a grid, and so on.
If you want to arrive to some other target cell (xt, yt), that is also covered: the answer is the value f[n][xt][yt].
Similarly, if you have multiple start or target cells, and you can start and finish at any of them, just alter the base case or sum the answers in the cells.
The general layout of the solution remains the same.
This obviously works in n * (2n+1) * (2n+1) * number-of-neighbors, which is O(n^3) for any constant number of neighbors (4 or 6 or 8...) a cell may have in our particular problem.
Finally, note that, at step k of the main loop, we need only two layers of the array f: f[k-1] is the source layer, and f[k] is the target layer.
So, instead of storing all layers for the whole time, we can store just two layers, as we don't need more: one for odd k and one for even k.
Using only two layers is as simple as changing all f[k] and f[k-1] to f[k%2] and f[(k-1)%2], respectively.
This lowers the memory requirement from O(n^3) down to O(n^2), as advertised in the beginning.
For a more mathematical solution, here are some steps that would perhaps lead to one.
First, consider the following problem: what is the number of ways to go from (xs, ys) to (xt, yt) in n steps, each step moving one square north, west, south, or east?
To arrive from x = xs to x = xt, we need H = |xt - xs| steps in the right direction (without loss of generality, let it be east).
Similarly, we need V = |yt - ys| steps in another right direction to get to the desired y coordinate (let it be south).
We are left with k = n - H - V "free" steps, which can be split arbitrarily into pairs of north-south steps and pairs of east-west steps.
Obviously, if k is odd or negative, the answer is zero.
So, for each possible split k = 2h + 2v of "free" steps into horizontal and vertical steps, what we have to do is construct a path of H+h steps east, h steps west, V+v steps south, and v steps north. These steps can be done in any order.
The number of such sequences is a multinomial coefficient, and is equal to n! / (H+h)! / h! / (V+v)! / v!.
To finally get the answer, just sum these over all possible h and v such that k = 2h + 2v.
This solution calculates the answer in O(n) if we precalculate the factorials, also in O(n), and consider all arithmetic operations to take O(1) time.
For a hexagonal grid, a complicating feature is that there is no such clear separation into horizontal and vertical steps.
Still, given the starting cell and the number of steps in each of the six directions, we can find the final cell, regardless of the order of these steps.
So, a solution can go as follows:
Enumerate all possible partitions of n into six summands a1, ..., a6.
For each such partition, find the final cell.
For each partition where the final cell is the cell we want, add multinomial coefficient n! / a1! / ... / a6! to the answer.
Just so, this takes O(n^6) time and O(1) memory.
By carefully studying the relations between different directions on a hexagonal grid, perhaps we can actually consider only the partitions which arrive at the target cell, and completely ignore all other partitions.
If so, this solution can be optimized into at least some O(n^3) or O(n^2) time, maybe further with decent algebraic skills.

Why is the greedy algorithm optimal?

Codility, lesson 14, task TieRopes (https://codility.com/demo/take-sample-test/tie_ropes). Stated briefly, the problem is to partition a list A of positive integers into the maximum number of (contiguous) sublists having sum at least K.
I've only come up with a greedy solution because that's the name of the lesson. It passes all the tests but I don't know why it is an optimal solution (if it is optimal at all).
int solution(int K, vector<int> &A) {
int sum = 0, count = 0;
for (int a : A)
sum += a;
if (sum >= K)
sum = 0;
return count;
Can somebody tell me if and why this solution is optimal?
Maybe I'm being naive or making some mistake here, but I think that is not too hard (although not obvious) to see that the algorithm is indeed optimal.
Suppose that you have an optimal partition of the list that with the maximum number of sublists. You may or may not have all of the elements of the list, but since adding an element to a valid list produces an also valid lists, lets suppose that any possible "remaining" element that was initially not assigned to any sublist was assigned arbitrarily to one of its adjacent sublists; so we have a proper optimal partition of the list, which we will call P1.
Now lets think about the partition that the greedy algorithm would produce, say P2. There are two things that can happen for the first sublist in P2:
It can be the same as the first sublist in P1.
It can be shorter than the first sublist in P1.
In 1. you would repeat the reasoning starting in the next element after the first sublist. If every subsequent sublist produced by the algorithm is equal to that in P1, then P1 and P2 will be equal.
In 2. you would also repeat the reasoning, but now you have at least one "extra" item available. So, again, the next sublist may:
2.1. Get as far as the next sublist in P1.
2.2. End before the next sublist in P1.
And repeat. So, in every case, you will have at least as many sublists as P1. Which means, that P2 is at least as good as any possible partition of the list, and, in particular, any optimal partition.
It's not a very formal demonstration, but I think it's valid. Please point out anything you think may be wrong.
Here are the ideas that lead to a formal proof.
If A is a suffix of B, then the maximum partition size for A is less than or equal to the maximum partition size for B, because we can extend the first sublist of a partition of A to include the new elements without decreasing its sum.
Every proper prefix of every sublist in the greedy solution sums to less than K.
There is no point in having gaps, because we can add the missing elements to an adjacent list (I thought that my wording of the question had ruled out this possibility by definition, but I'll say it anyway).
The formal proof can be carried out by induction to show that, for every nonnegative integer i, there exists an optimal solution that agrees with the greedy solution on the first i sublists of each. It follows that, when i is sufficiently large, the only solution that agrees with greedy is greedy, so the greedy solution is optimal.
The basis i = 0 is trivial, since an arbitrary optimal solution will do. The inductive step consists of finding an optimal solution that agrees with greedy on the first i sublists and then shrinking the i+1th sublist to match the greedy solution (by observation 2, we really are shrinking that sublist, since it starts at the same position as greedy's; by observation 1, we can extend the i+2th sublist of the optimal solution correspondingly).

Partition a set into k groups with minimum number of moves

You have a set of n objects for which integer positions are given. A group of objects is a set of objects at the same position (not necessarily all the objects at that position: there might be multiple groups at a single position). The objects can be moved to the left or right, and the goal is to move these objects so as to form k groups, and to do so with the minimum distance moved.
For example:
With initial positions at [4,4,7], and k = 3: the minimum cost is 0.
[4,4,7] and k = 2: minimum cost is 0
[1,2,5,7] and k = 2: minimum cost is 1 + 2 = 3
I've been trying to use a greedy approach (by calculating which move would be shortest) but that wouldn't work because every move involves two elements which could be moved either way. I haven't been able to formulate a dynamic programming approach as yet but I'm working on it.
This problem is a one-dimensional instance of the k-medians problem, which can be stated as follows. Given a set of points x_1...x_n, partition these points into k sets S_1...S_k and choose k locations y_1...y_k in a way that minimizes the sum over all x_i of |x_i - y_f(i)|, where y_f(i) is the location corresponding of the set to which x_i is assigned.
Due to the fact that the median is the population minimizer for absolute distance (i.e. L_1 norm), it follows that each location y_j will be the median of the elements x in the corresponding set S_j (hence the name k-medians). Since you are looking at integer values, there is the technicality that if S_j contains an even number of elements, the median might not be an integer, but in such cases choosing either the next integer above or below the median will give the same sum of absolute distances.
The standard heuristic for solving k-medians (and the related and more common k-means problem) is iterative, but this is not guaranteed to produce an optimal or even good solution. Solving the k-medians problem for general metric spaces is NP-hard, and finding efficient approximations for k-medians is an open research problem. Googling "k-medians approximation", for example, will lead to a bunch of papers giving approximation schemes.
In one dimension things become easier, and you can use a dynamic programming approach. A DP solution to the related one-dimensional k-means problem is described in this paper, and the source code in R is available here. See the paper for details, but the idea is essentially the same as what #SajalJain proposed, and can easily be adapted to solve the k-medians problem rather than k-means. For j<=k and m<=n let D(j,m) denote the cost of an optimal j-medians solution to x_1...x_m, where the x_i are assumed to be in sorted order. We have the recurrence
D(j,m) = min (D(j-1,q) + Cost(x_{q+1},...,x_m)
where q ranges from j-1 to m-1 and Cost is equal to the sum of absolute distances from the median. With a naive O(n) implementation of Cost, this would yield an O(n^3k) DP solution to the whole problem. However, this can be improved to O(n^2k) due to the fact that the Cost can be updated in constant time rather than computed from scratch every time, using the fact that, for a sorted sequence:
Cost(x_1,...,x_h) = Cost(x_2,...,x_h) + median(x_1...x_h)-x_1 if h is odd
Cost(x_1,...,x_h) = Cost(x_2,...,x_h) + median(x_2...x_h)-x_1 if h is even
See the writeup for more details. Except for the fact that the update of the Cost function is different, the implementation will be the same for k-medians as for k-means.
as I understand, the problems is:
we have n points on a line.
we want to place k position on the line. I call them destinations.
move each of n points to one of the k destinations so the sum of distances is minimum. I call this sum, total cost.
destinations can overlap.
An obvious fact is that for each point we should look for the nearest destinations on the left and the nearest destinations on the right and choose the nearest.
Another important fact is all destinations should be on the points. because we can move them on the line to right or to left to reach a point without increasing total distance.
By these facts consider following DP solution:
DP[i][j] means the minimum total cost needed for the first i point, when we can use only j destinations, and have to put a destination on the i-th point.
to calculate DP[i][j] fix the destination before the i-th point (we have i choice), and for each choice (for example k-th point) calculate the distance needed for points between the i-th point and the new point added (k-th point). add this with DP[k][j - 1] and find the minimum for all k.
the calculation of initial states (e.g. j = 1) and final answer is left as an exercise!
Task 0 - sort the position of the objects in non-decreasing order
Let us define 'center' as the position of the object where it is shifted to.
Now we have two observations;
For N positions the 'center' would be the position which is nearest to the mean of these N positions. Example, let 1,3,6,10 be the positions. Then mean = 5. Nearest position is 6. Hence the center for these elements is 6. This gives us the position with minimum cost of moving when all elements need to be grouped into 1 group.
Let N positions be grouped into K groups "optimally". When N+1 th object is added, then it will disturb only the K th group, i.e, first K-1 groups will remain unchanged.
From these observations, we build a dynamic programming approach.
Let Cost[i][k] and Center[i][k] be two 2D arrays.
Cost[i][k] = minimum cost when first 'i' objects are partitioned into 'k' groups
Center[i][k] stores the center of the 'i-th' object when Cost[i][k] is computed.
Let {L} be the elements from i-L,i-L+1,..i-1 which have the same center.
(Center[i-L][k] = Center[i-L+1][k] = ... = Center[i-1][k]) These are the only objects that need to be considered in the computation for i-th element (from observation 2)
Cost[i][k] will be
min(Cost[i-1][k-1] , Cost[i-L-1][k-1] + computecost(i-L, i-L+1, ... ,i))
Update Center[i-L ... i][k]
computecost() can be found trivially by finding the center (from observation 1)
Time Complexity:
Sorting O(NlogN)
Total Cost Computation Matrix = Total elements * Computecost = O(NK * N)
Total = O(NlogN + N*NK) = O(N*NK)
Let's look at k=1.
For k=1 and n odd, all points should move to the center point. For k=1 and n even, all points should move to either of the center points or any spot between them. By 'center' I mean in terms of number of points to either side, i.e. the median.
You can see this because if you select a target spot, x, with more points to its right than it's left, then a new target 1 to the right of x would result in a cost reduction (unless there is exactly one more point to the right than the left and the target spot is a point, in which case n is even and the target is on/between the two center points).
If your points are already sorted, this is an O(1) operation. If not, I believe it's O(n) (via an order statistic algorithm).
Once you've found the spot that all points are moving to, it's O(n) to find the cost.
Thus regardless of whether the points are sorted or not, this is O(n).

Divide and conquer on sorted input with Haskell

For a part of a divide and conquer algorithm, I have the following question where the data structure is not fixed, so set is not to be taken literally:
Given a set X sorted wrt. some ordering of elements and subsets A and B together consisting of all elements in X, can sorted versions A' and B' of A and B be constructed in time linear in the number of elements in X ?
At the moment I am doing a standard sort at each recursive step giving the recursion
T(n) = 2*T(n/2) + O(n*log n)
for the complexity rather than
T(n) = 2*T(n/2) + O(n)
like in the procedural version, where one can utilize a structure with constant-time lookup on A and B to form A' and B' in linear time.
The added log n factor carries over to the overall complexity, giving O(n* (log n)^2) instead of O(n* log n).
Perhaps I am understanding the term lookup incorrectly. The creation of A' and B' in linear time is easy to do if membership of A and B can be checked in constant time.
I didn't succeed in my attempt at making things clearer by abstracting
away the specifics, so here is the actual problem:
I am implementing the algorithm for the closest pair problem. Given a
finite collection P of points in the plane it finds a pair of points
in P with the minimal distance. It works roughly as follows:
If P
has at least 4 points, form Px and
Py, the points in P sorted by x- and y-coordinate. By
splitting Px form L and R, the left- and right-most
halves of points. Recursively compute the closest pair distance in L and
R, let d be the minimum of the two. Now the minimum distance in P is
either d or the distance from a point in L to a point in R. If the
minimal distance is between points from separate halves, it will appear
between a pair of points lying in the strip of width 2*d centered around
the line x = x0, where x0 is the x-coordinate of
a right-most point in L. It turns out that to find a potential minimal distance pair in
the strip, it is enough to compute for every point in the the strip its
distance to the seven following points if the strip points are in a
collection sorted by y-coordinate.
It is in the steps with forming the sorted collections to pass into the recursion and sorting the strip points by y-coordinate where I don't see how to, in
Haskell, utilize having sorted P at the beginning of the recursion.
The following function may interest you:
partition :: (a -> Bool) -> [a] -> ([a], [a])
partition f xs = (filter f xs, filter (not . f) xs)
If you can compute set-membership in constant time, that is, there is a predicate of type a -> Bool that runs in constant time, then partition will run in time linear in the length of its input list. Furthermore, partition is stable, so that if its input list is sorted, then so are both output lists.
I would also like to point out that the above definition is meant to be give the semantics of partition only; the real implementation in GHC only walks its input list once, even if the entire output is forced.
Of course, the real crux of the question is providing a constant-time predicate. The way you phrased the question leaves sets A and B quite unstructured -- you demand that we can handle any particular partitioning. In that case, I don't know of any particularly Haskell-y way of doing constant-time lookup in arbitrary sets. However, often these problems are a bit more structured: often, rather than set-membership, you are actually interested in whether some easily-computable property holds or not. In this case, the above is just what the doctor ordered.
I know very very little about Haskell but here's a shot anyway.
Given that (A+B) == X can;t you just iterate through X (in the sorted order) and add each element to A' or B' if it exists in A or B? Give linear time lookup of element x in the Sets A and B that would be linear.

Revisit: 2D Array Sorted Along X and Y Axis

So, this is a common interview question. There's already a topic up, which I have read, but it's dead, and no answer was ever accepted. On top of that, my interests lie in a slightly more constrained form of the question, with a couple practical applications.
Given a two dimensional array such that:
Elements are unique.
Elements are sorted along the x-axis and the y-axis.
Neither sort predominates, so neither sort is a secondary sorting parameter.
As a result, the diagonal is also sorted.
All of the sorts can be thought of as moving in the same direction. That is to say that they are all ascending, or that they are all descending.
Technically, I think as long as you have a >/=/< comparator, any total ordering should work.
Elements are numeric types, with a single-cycle comparator.
Thus, memory operations are the dominating factor in a big-O analysis.
How do you find an element? Only worst case analysis matters.
Solutions I am aware of:
A variety of approaches that are:
O(nlog(n)), where you approach each row separately.
O(nlog(n)) with strong best and average performance.
One that is O(n+m):
Start in a non-extreme corner, which we will assume is the bottom right.
Let the target be J. Cur Pos is M.
If M is greater than J, move left.
If M is less than J, move up.
If you can do neither, you are done, and J is not present.
If M is equal to J, you are done.
Originally found elsewhere, most recently stolen from here.
And I believe I've seen one with a worst-case O(n+m) but a optimal case of nearly O(log(n)).
What I am curious about:
Right now, I have proved to my satisfaction that naive partitioning attack always devolves to nlog(n). Partitioning attacks in general appear to have a optimal worst-case of O(n+m), and most do not terminate early in cases of absence. I was also wondering, as a result, if an interpolation probe might not be better than a binary probe, and thus it occurred to me that one might think of this as a set intersection problem with a weak interaction between sets. My mind cast immediately towards Baeza-Yates intersection, but I haven't had time to draft an adaptation of that approach. However, given my suspicions that optimality of a O(N+M) worst case is provable, I thought I'd just go ahead and ask here, to see if anyone could bash together a counter-argument, or pull together a recurrence relation for interpolation search.
Here's a proof that it has to be at least Omega(min(n,m)). Let n >= m. Then consider the matrix which has all 0s at (i,j) where i+j < m, all 2s where i+j >= m, except for a single (i,j) with i+j = m which has a 1. This is a valid input matrix, and there are m possible placements for the 1. No query into the array (other than the actual location of the 1) can distinguish among those m possible placements. So you'll have to check all m locations in the worst case, and at least m/2 expected locations for any randomized algorithm.
One of your assumptions was that matrix elements have to be unique, and I didn't do that. It is easy to fix, however, because you just pick a big number X=n*m, replace all 0s with unique numbers less than X, all 2s with unique numbers greater than X, and 1 with X.
And because it is also Omega(lg n) (counting argument), it is Omega(m + lg n) where n>=m.
An optimal O(m+n) solution is to start at the top-left corner, that has minimal value. Move diagonally downwards to the right until you hit an element whose value >= value of the given element. If the element's value is equal to that of the given element, return found as true.
Otherwise, from here we can proceed in two ways.
Strategy 1:
Move up in the column and search for the given element until we reach the end. If found, return found as true
Move left in the row and search for the given element until we reach the end. If found, return found as true
return found as false
Strategy 2:
Let i denote the row index and j denote the column index of the diagonal element we have stopped at. (Here, we have i = j, BTW). Let k = 1.
Repeat the below steps until i-k >= 0
Search if a[i-k][j] is equal to the given element. if yes, return found as true.
Search if a[i][j-k] is equal to the given element. if yes, return found as true.
Increment k
1 2 4 5 6
2 3 5 7 8
4 6 8 9 10
5 8 9 10 11
