The minimum number of "insertions" to sort an array - algorithm

Suppose there is an unordered list. The only operation we can do is to move an element and insert it back to any place. How many moves does it take to sort the whole list?
I guess the answer is size of the list - size of longest ordered sequence, but I have no idea how to prove it.

First note that moving an element doesn't change relative order of elements other than the one being moved.
Consider the longest non-decreasing subsequence (closely related to the longest increasing subsequence - the way to find them are similar).
By only moving the element not in this sequence, it's easy to see that we'd end up with a sorted list, since all the elements in this sequence are already sorted relative to each other.
If we don't move any elements in this sequence, any other element between two elements in this subsequence is guaranteed to be either greater than the larger element, or smaller than the smaller one (if this is not true, it itself would be in the longest sequence), so it needs to be moved.
(see below for example)
Does it need to be non-decreasing? Yes. Consider if two consecutive elements in this sequence are decreasing. In this case it would be impossible to sort the list without moving those two elements.
To minimize the number of moves required, it's sufficient to pick the longest sequence possible, as done above.
So the total number of moves required is the size of the list minus the size of the longest non-decreasing subsequence.
An example explaining the value of an element not in the non-decreasing subsequence mentioned above:
Consider the longest non-decreasing subsequence 1 x x 2 x x 2 x 4 (the x's are some elements not part of the sequence).
Now consider the possible values for an x between 2 and 4.
If it's 2, 3 or 4, the longest subsequence would include that element. If it's greater than 4 or smaller than 2, it needs to be moved.

It is easy to prove that size of the list - size of longest ordered sequence is enough always to sort any sequence, e.g. with mathematical induction.
You can also easily prove that for some unordered sequences, it is the best what you can do by simply finding such trivial sequence. E.g. to sort the sequence 3, 1, 2 you need one move of one item (3) and it's trivial to see that it cannot be made faster, and obviously size of the list - size of longest ordered sequence is equal to 1.
However, proving that it is the best for all sequences is impossible because that statement is not actually true: A counter example is a sequence with multiple sorted sub-sequences S[i], where max(S[i]) < min(S[i+1]) for every i. For example see the sequence 1, 2, 3, 1000, 4, 5, 6.

Related

Minimum operations to make K Non Decreasing Array [duplicate]

We know about an algorithm that will find the Longest Increasing subsequence in O(nlogn). I was wondering whether we can find the Longest non-decreasing subsequence with similar time complexity?
For example, consider an array : (4,10,4,8,9).
The longest increasing subsequence is (4,8,9).
And a longest non-decreasing subsequence would be (4,4,8,9).
First, here’s a “black box” approach that will let you find the longest nondecreasing subsequence using an off-the-shelf solver for longest increasing subsequences. Let’s take your sample array:
4, 10, 4, 8, 9
Now, imagine we transformed this array as follows by adding a tiny fraction to each number:
4.0, 10.1, 4.2, 8.3, 9.4
Changing the numbers this way will not change the results of any comparisons between two different integers, since the integer components have a larger magnitude difference than the values after the decimal point. However, if you compare the two 4s now, the latter 4 compares bigger than the previous one. If you now find the longest nondecreasing subsequence, you get back [4.0, 4.2, 8.3, 9.4], which you can then map back to [4, 4, 8, 9].
More generally, if you’re working with an array of n integer values, you can add i / n to each of the numbers, where i is its index, and you’ll be left with a sequence of distinct numbers. From there running a regular LIS algorithm will do the trick.
If you can’t work with fractions this way, you could alternatively multiply each number by n and then add in i, which also works.
On the other hand, suppose you have the code for a solver for LIS and want to convert it to one that solves the longest nondecreasing subsequence problem. The reasoning above shows that if you treat later copies of numbers as being “larger” than earlier copies, then you can just use a regular LIS. Given that, just read over the code for LIS and find spots where comparisons are made. When a comparison is made between two equal values, break the tie by considering the later appearance to be bigger than the earlier one.
I think the following will work in O(nlogn):
Scan the array from right to left, and for each element solve a subproblem of finding a longest subsequence starting from the given element of the array. E.g. if your array has indices from 0 to 4, then you start with the subarray [4,4] and check what's the longest sequence starting from 4, then you check subarray [3,4] and what's the longest subsequence starting from 3, next [2,4], and so on, until [0,4]. Finally, you choose the longest subsequence established in either of the steps.
For the last element (so subarray [4,4]) the longest sequence is always of length 1.
When in the next iteration you consider another element to the left (e.g., in the second step you consider the subarray [3,4], so the new element is element with the index 3 in the original array) you check if that element is not greater than some of the elements to its right. If so, you can take the result for some element from the right and add one.
For instance:
[4,4] -> longest sequence of length 1 (9)
[3,4] -> longest sequence of length 2 (8,9) 1+1 (you take the longest sequence from above which starts with 9 and add one to its length)
[2,4] -> longest sequence of length 3 (4,8,9) 2+1 (you take the longest sequence from above, i.e. (8,9), and add one to its length)
[1,4] -> longest sequence of length 1 (10) nothing to add to (10 is greater than all the elements to its right)
[0,4] -> longest sequence of length 4 (4,4,8,9) 3+1 (you take the longest sequence above, i.e. (4,8,9), and add one to its length)
The main issue is how to browse all the candidates to the right in logarithmic time. For that you keep a sorted map (a balanced binary tree). The keys are the already visited elements of the array. The values are the longest sequence lengths obtainable from that element. No need to store duplicates - among duplicate keys store the entry with largest value.

Minimum amount of swaps needed so no number has two neighbours that are both greater..?

The problem statement goes like this: Given a list of N < 500 000 distinct numbers, find the minimum number of swaps of adjacent elements required such that no number has two neighbours that are both greater. A number can only be swapped with a neighbour.
Hint given: Use a segment tree or a fenwick tree.
I don't really get the idea of how I should use a sum-tree to solve this problem.
Example inputs:
Input 1:
5 (amount of elements in the list)
3 1 4 2 0
output 1: 1
input 2:
6
4 5 2 0 1 3
output 2: 4
I can do it in O(n log n) time and O(n) extra space. But first let's look at the quadratic solution I've hinted at earlier:
initialize the result accumulator to 0
while the input list has more than two elements
find the lowest element in the list
add its distance from the closer end of the list to the accumulator
remove the element from the list.
output the accumulator.
Why does this work? First, Let's look at how a sequence that requires zero swap looks like. Since there are no duplicates, if the lowest element is anywhere but at either end, it is surrounded by two elements that are both greater, violating the requirement, thus the lowest element must be at one of the ends. Recurse into the subsequence that excludes this element. To bring a sequence into this state: at least as many swaps involving the lowest element as in the greedy algorithm are required to move the lowest element to one end, and since swaps involving the lowest element do not change the relative ordering of the rest, there is no penalty to reordering them to the front either.
Unfortunately, just implementing this with a list is quadratic. How do you make it faster? Have a finger tree that tracks the subtree weight and minimum value of each subtree and update these as you remove individual minima:
To initialize the tree: First, think of each element in the list as a one-element sublist with its minimum equal to its value. Then, while you have more than one sublist, group the subsequences in pairs, building a tree of subsequences. The length of a sequence is the sum of lengths of both its halves, and its minimum is equal to whichever minimum from both halves is lower.
To remove the minimum from a subsequence while tracking its index in the sequence:
Decrease the length of the subsequence
Remove the minimum from whichever half's minimum is equal to this subsequence minimum
the new minimum is the lower of its halves' new minima
The index of the minimum is equal to its index in its respective half, plus the length of the left half if the minimum was in the right half.
The distance from one end is then equal to either the index or (length before removal - index - 1), whichever is lower.

Find the minimum two non edge, non adjacent entries in an array

I had the following question
Find the smallest two nonadjacent values in an array, such that non of these elements is on the array edge (no A[0] and no A[n-1])
The runtime of the algorithm should be O(n)
I first thought about sorting the array, but sorting would cost O(nlogn)
Ignoring this fact for a second, if we sort the array, we can not just take the first two values, since they might violate the conditions mentioned above? and then what? take the next element and try, if not take the next, I can't see an easy solution there
Another solution is to generate all allowed pairs and find the pair with the minimum sum. But finding all pairs cost O(n^2)
Any ideas?
In linear time, find the ith smallest entry (excluding the first and last) for i from 1 to 4. The best possibility is a pair of these. If 1 and 2 are nonadjacent, then that's the best. Otherwise, if 1 and 3 are nonadjacent, then that's the best. Otherwise, 2 and 3 are bordering 1 (hence not each other), and the possibilities are 1 and 4, or 2 and 3.
You could go with your sort first, then, unless I am missing something, take elements 0 and 2. Those would be the smallest non-adjacent values.
As long as the array is 3 elements or greater you should be assured that the element values in position 0 and 2 are the smallest (and if you need them to be, non-consecutive) as well as non-adjacent in the array.
If your array is sorted, you would only have to keep comparing alternate elements (like indices (0,2), (1,3), (2,5)) and so on and then find the pair with the smallest difference. But without sorting, you are right in saying that the run time complexity would then become O(n^2) as you would have to compare every element with every other element in the array.

Check for duplicate subsequences of length >= N in sequence

I have a sequence of values and I want to know if it contains an repeated subsequences of a certain minimum length. For instance:
1, 2, 3, 4, 5, 100, 99, 101, 3, 4, 5, 100, 44, 99, 101
Contains the subsequence 3, 4, 5, 100 twice. It also contains the subsequence 99, 101 twice, but that subsequence is two short to care about.
Is there an efficient algorithm for checking the existence of such a subsequence? I'm not especially interested in location the sequences (though that would be helpful for verification), I'm primarily just interested in a True/False answer, given a sequence and a minimum subsequence length.
My only approach so far is to brute force search it: for each item in the sequence, find all the other locations where the item occurs (already at O(N^2)), and then walk forward one step at a time from each location and see if the next item matches, and keep going until I find a mismatch or find a matching subsequence of sufficient length.
Another thought I had but haven't been able to develop into an actual approach is to build a tree of all the sequences, so that each number is a node, and a child of its the number that preceded it, whereever that node happens to already be in the tree.
There are O(k) solutions (k - the length of the whole sequence) for any value of N.
Solution #1: Build a suffix tree for the input sequence(using Ukkonen's algorithm). Iterate over the nodes with two or more children and check if at least one of them has depth >= N.
Solution #2: Build a suffix automaton for the input sequence.Iterate over all the states which right context contains at least two different strings and check if at least one of those nodes has distance >= N from the initial state of the automaton.
Solution #3:Suffix array and the longest common prefix technique can also be used(build the suffix array for input sequence , compute the longest common prefix array, check that there is a pair of adjacent suffices with common prefix with length at least N).
These solutions have O(k) time complexity under the assumption that alphabet size is constant(alphabet consists of all elements of the input sequence).
If it is not the case, it is still possible to obtain O(k log k) worst case time complexity(by storing all transitions in a tree or in an automaton in a map) or O(k) on average using hashmap.
P.S I use terms string and sequence interchangeably here.
If you only care about subsequences of length exactly N (for example, if just want to check that there are no duplicates), then there is a quadratic solution: use the KMP algorithm for every subsequence.
Let's assume that there are k elements in the whole sequence.
For every subsequence of length N (O(k) of them):
Build its failure function (takes O(N))
Search for it in the remainder of the sequence (takes O(k))
So, assuming N << k, the whole algorithm is indeed O(k^2).
Since your list is unordered, you're going to have to visit every item at least once.
What I'm thinking is that you first go through your list and create a dictionary where you store the number as a key along with all the indices it appears in your sequence. Like:
Key: Indices
1: 0
2: 1
3: 2, 8
....
Where the number 1 appears at index 0, the number 2 appears at index 1, the number 3 appears at indices 2 and 8, and so on.
With that created you can then go through the dictionary keys and start comparing it against the sequences at the other locations. This should save on some of the brute force since you don't have to revisit each number through the initial sequence each time.

How to find subsequences?

I'm given a random sequence of numbers of random length, consisting of 0's, 1's, 2's:
201102220221
I'm also given a number: either 1 or 2. I can only loop through the sequence once, and I need to identify all the subsequences of the number I'm given, run that value through another function and add the value to a sum (initialized to 0).
Any 0's can be replaced with a 1 or a 2. Subsequences can therefore be "expanded", by replacing 0's beside it with the number. If a subsequence cannot be expanded to a minimum length of more than 4, I need to discard it.
For example, say I'm given the sequence:
11011002102
and the number 1. I identify the subsequence of length 2 at the beginning (See first element). It can be expanded to a subsequence of length 7, by replacing the first 0, third and fourth zero with 1's. Therefore I run its current length through a function and add it to the sum.
sum += function(2);
I then identify the next subsequence of 1's (See fourth element). It's currently of size 2 and can be expanded to a maximum size of 7, by replacing the zeros around it. So I pass its length to a function and add it to the sum.
sum += function (2);
I finally identify the last subsequence of 1's (See sixth element). It currently has a length of 1 and can be expanded to a maximum size of 2, by replacing the zero beside it with a 1, which is less than 4, so I discard it.
Could someone help me write a function that does what was described above by only looping through the sequence once. Please do not give me real code, just ideas, suggestions or pseudocode.
I don't have any work to share, I'm completely lost. I also know no algorithms, so please don't start talking about things like dynamic programming or linear programming, but instead explain possible ways to approach the problem.
Given the requirement that you can only loop through the sequence once, you know the basic structure of the code.
Given the parameters for discarding a subsequence (expanded length of 4 or more) and for processing a subsequence (unexpanded sequence length), you know what data you need to track along the way. Work out how to best store this data according to your environment and language conventions.
At each iteration of the loop, consider the current character of the input series and how it affects the data stored.
I've tried to clarify the question here without just handing you the solution. Feel free to ask more questions.
Edit:
Consider how you would break the problem down step by step. Here are the iterations of your for loop:
1----------
Okay, we're looking for 1s and we've found one straight up.
-1---------
Cool, another 1, now our subsequence length has increased to 2
--0--------
Right, so as this is not a 1, the length of this subsequence is now known to be 2 - it doesn't increase any more, but as this is a 0, it might still qualify if it can expand to at least 4. The expanded subsequence length is now 3.
---1-------
The expanded subsequence length is now 4! This means we can add the last sequence length to the total sum after passing it through function. This is also the start of another subsequence - so subsequence length is now reset to 1, but expanded subsequence length is still valid at 4. That means that this subsequence is already long enough not to be rejected, but we haven't finished counting it's length at this stage.
----1------
subsequence length = 2, and expanded subsequence length = 5
-----0-----
This marks the end of the 2nd subsequence, process it as before. Etc
------0----
-------2--- <- expanded subsequence length gets reset here
--------1-- <- start of another subsequence
---------0- <- expanded length is 2
----------2 <- expanded length is not long enough for this to qualify, discard it
So, fairly straight forward. There are two factors we need to keep track of: subsequence length and expanded subsequence length.
Once you've got that working, think about what happens for this input sequence "1010101".
Forget the computer for a bit; think of it as a faster version of using pencil and paper.
Try to imagine how you would solve this problem as you traverse each element of the sequence; what you might want to write down and/or edit on your piece of paper at each iteration (each element you reach in the sequence).
For example:
Sequence = 11011002102
Index 0:
Value is 1
Current is 1, Previous was null
Tracking a subsequence of 1's starting at 0 => [1,0] = 1
Index 1:
Value is 1
Current is 1, Previous was 1
Current == Previous so the subsequence length is increased by 1 => [1,0] = 2
...

Resources