How to find subsequences? - algorithm

I'm given a random sequence of numbers of random length, consisting of 0's, 1's, 2's:
201102220221
I'm also given a number: either 1 or 2. I can only loop through the sequence once, and I need to identify all the subsequences of the number I'm given, run that value through another function and add the value to a sum (initialized to 0).
Any 0's can be replaced with a 1 or a 2. Subsequences can therefore be "expanded", by replacing 0's beside it with the number. If a subsequence cannot be expanded to a minimum length of more than 4, I need to discard it.
For example, say I'm given the sequence:
11011002102
and the number 1. I identify the subsequence of length 2 at the beginning (See first element). It can be expanded to a subsequence of length 7, by replacing the first 0, third and fourth zero with 1's. Therefore I run its current length through a function and add it to the sum.
sum += function(2);
I then identify the next subsequence of 1's (See fourth element). It's currently of size 2 and can be expanded to a maximum size of 7, by replacing the zeros around it. So I pass its length to a function and add it to the sum.
sum += function (2);
I finally identify the last subsequence of 1's (See sixth element). It currently has a length of 1 and can be expanded to a maximum size of 2, by replacing the zero beside it with a 1, which is less than 4, so I discard it.
Could someone help me write a function that does what was described above by only looping through the sequence once. Please do not give me real code, just ideas, suggestions or pseudocode.
I don't have any work to share, I'm completely lost. I also know no algorithms, so please don't start talking about things like dynamic programming or linear programming, but instead explain possible ways to approach the problem.

Given the requirement that you can only loop through the sequence once, you know the basic structure of the code.
Given the parameters for discarding a subsequence (expanded length of 4 or more) and for processing a subsequence (unexpanded sequence length), you know what data you need to track along the way. Work out how to best store this data according to your environment and language conventions.
At each iteration of the loop, consider the current character of the input series and how it affects the data stored.
I've tried to clarify the question here without just handing you the solution. Feel free to ask more questions.
Edit:
Consider how you would break the problem down step by step. Here are the iterations of your for loop:
1----------
Okay, we're looking for 1s and we've found one straight up.
-1---------
Cool, another 1, now our subsequence length has increased to 2
--0--------
Right, so as this is not a 1, the length of this subsequence is now known to be 2 - it doesn't increase any more, but as this is a 0, it might still qualify if it can expand to at least 4. The expanded subsequence length is now 3.
---1-------
The expanded subsequence length is now 4! This means we can add the last sequence length to the total sum after passing it through function. This is also the start of another subsequence - so subsequence length is now reset to 1, but expanded subsequence length is still valid at 4. That means that this subsequence is already long enough not to be rejected, but we haven't finished counting it's length at this stage.
----1------
subsequence length = 2, and expanded subsequence length = 5
-----0-----
This marks the end of the 2nd subsequence, process it as before. Etc
------0----
-------2--- <- expanded subsequence length gets reset here
--------1-- <- start of another subsequence
---------0- <- expanded length is 2
----------2 <- expanded length is not long enough for this to qualify, discard it
So, fairly straight forward. There are two factors we need to keep track of: subsequence length and expanded subsequence length.
Once you've got that working, think about what happens for this input sequence "1010101".

Forget the computer for a bit; think of it as a faster version of using pencil and paper.
Try to imagine how you would solve this problem as you traverse each element of the sequence; what you might want to write down and/or edit on your piece of paper at each iteration (each element you reach in the sequence).
For example:
Sequence = 11011002102
Index 0:
Value is 1
Current is 1, Previous was null
Tracking a subsequence of 1's starting at 0 => [1,0] = 1
Index 1:
Value is 1
Current is 1, Previous was 1
Current == Previous so the subsequence length is increased by 1 => [1,0] = 2
...

Related

Minimum operations to make K Non Decreasing Array [duplicate]

We know about an algorithm that will find the Longest Increasing subsequence in O(nlogn). I was wondering whether we can find the Longest non-decreasing subsequence with similar time complexity?
For example, consider an array : (4,10,4,8,9).
The longest increasing subsequence is (4,8,9).
And a longest non-decreasing subsequence would be (4,4,8,9).
First, here’s a “black box” approach that will let you find the longest nondecreasing subsequence using an off-the-shelf solver for longest increasing subsequences. Let’s take your sample array:
4, 10, 4, 8, 9
Now, imagine we transformed this array as follows by adding a tiny fraction to each number:
4.0, 10.1, 4.2, 8.3, 9.4
Changing the numbers this way will not change the results of any comparisons between two different integers, since the integer components have a larger magnitude difference than the values after the decimal point. However, if you compare the two 4s now, the latter 4 compares bigger than the previous one. If you now find the longest nondecreasing subsequence, you get back [4.0, 4.2, 8.3, 9.4], which you can then map back to [4, 4, 8, 9].
More generally, if you’re working with an array of n integer values, you can add i / n to each of the numbers, where i is its index, and you’ll be left with a sequence of distinct numbers. From there running a regular LIS algorithm will do the trick.
If you can’t work with fractions this way, you could alternatively multiply each number by n and then add in i, which also works.
On the other hand, suppose you have the code for a solver for LIS and want to convert it to one that solves the longest nondecreasing subsequence problem. The reasoning above shows that if you treat later copies of numbers as being “larger” than earlier copies, then you can just use a regular LIS. Given that, just read over the code for LIS and find spots where comparisons are made. When a comparison is made between two equal values, break the tie by considering the later appearance to be bigger than the earlier one.
I think the following will work in O(nlogn):
Scan the array from right to left, and for each element solve a subproblem of finding a longest subsequence starting from the given element of the array. E.g. if your array has indices from 0 to 4, then you start with the subarray [4,4] and check what's the longest sequence starting from 4, then you check subarray [3,4] and what's the longest subsequence starting from 3, next [2,4], and so on, until [0,4]. Finally, you choose the longest subsequence established in either of the steps.
For the last element (so subarray [4,4]) the longest sequence is always of length 1.
When in the next iteration you consider another element to the left (e.g., in the second step you consider the subarray [3,4], so the new element is element with the index 3 in the original array) you check if that element is not greater than some of the elements to its right. If so, you can take the result for some element from the right and add one.
For instance:
[4,4] -> longest sequence of length 1 (9)
[3,4] -> longest sequence of length 2 (8,9) 1+1 (you take the longest sequence from above which starts with 9 and add one to its length)
[2,4] -> longest sequence of length 3 (4,8,9) 2+1 (you take the longest sequence from above, i.e. (8,9), and add one to its length)
[1,4] -> longest sequence of length 1 (10) nothing to add to (10 is greater than all the elements to its right)
[0,4] -> longest sequence of length 4 (4,4,8,9) 3+1 (you take the longest sequence above, i.e. (4,8,9), and add one to its length)
The main issue is how to browse all the candidates to the right in logarithmic time. For that you keep a sorted map (a balanced binary tree). The keys are the already visited elements of the array. The values are the longest sequence lengths obtainable from that element. No need to store duplicates - among duplicate keys store the entry with largest value.

Check for duplicate subsequences of length >= N in sequence

I have a sequence of values and I want to know if it contains an repeated subsequences of a certain minimum length. For instance:
1, 2, 3, 4, 5, 100, 99, 101, 3, 4, 5, 100, 44, 99, 101
Contains the subsequence 3, 4, 5, 100 twice. It also contains the subsequence 99, 101 twice, but that subsequence is two short to care about.
Is there an efficient algorithm for checking the existence of such a subsequence? I'm not especially interested in location the sequences (though that would be helpful for verification), I'm primarily just interested in a True/False answer, given a sequence and a minimum subsequence length.
My only approach so far is to brute force search it: for each item in the sequence, find all the other locations where the item occurs (already at O(N^2)), and then walk forward one step at a time from each location and see if the next item matches, and keep going until I find a mismatch or find a matching subsequence of sufficient length.
Another thought I had but haven't been able to develop into an actual approach is to build a tree of all the sequences, so that each number is a node, and a child of its the number that preceded it, whereever that node happens to already be in the tree.
There are O(k) solutions (k - the length of the whole sequence) for any value of N.
Solution #1: Build a suffix tree for the input sequence(using Ukkonen's algorithm). Iterate over the nodes with two or more children and check if at least one of them has depth >= N.
Solution #2: Build a suffix automaton for the input sequence.Iterate over all the states which right context contains at least two different strings and check if at least one of those nodes has distance >= N from the initial state of the automaton.
Solution #3:Suffix array and the longest common prefix technique can also be used(build the suffix array for input sequence , compute the longest common prefix array, check that there is a pair of adjacent suffices with common prefix with length at least N).
These solutions have O(k) time complexity under the assumption that alphabet size is constant(alphabet consists of all elements of the input sequence).
If it is not the case, it is still possible to obtain O(k log k) worst case time complexity(by storing all transitions in a tree or in an automaton in a map) or O(k) on average using hashmap.
P.S I use terms string and sequence interchangeably here.
If you only care about subsequences of length exactly N (for example, if just want to check that there are no duplicates), then there is a quadratic solution: use the KMP algorithm for every subsequence.
Let's assume that there are k elements in the whole sequence.
For every subsequence of length N (O(k) of them):
Build its failure function (takes O(N))
Search for it in the remainder of the sequence (takes O(k))
So, assuming N << k, the whole algorithm is indeed O(k^2).
Since your list is unordered, you're going to have to visit every item at least once.
What I'm thinking is that you first go through your list and create a dictionary where you store the number as a key along with all the indices it appears in your sequence. Like:
Key: Indices
1: 0
2: 1
3: 2, 8
....
Where the number 1 appears at index 0, the number 2 appears at index 1, the number 3 appears at indices 2 and 8, and so on.
With that created you can then go through the dictionary keys and start comparing it against the sequences at the other locations. This should save on some of the brute force since you don't have to revisit each number through the initial sequence each time.

The minimum number of "insertions" to sort an array

Suppose there is an unordered list. The only operation we can do is to move an element and insert it back to any place. How many moves does it take to sort the whole list?
I guess the answer is size of the list - size of longest ordered sequence, but I have no idea how to prove it.
First note that moving an element doesn't change relative order of elements other than the one being moved.
Consider the longest non-decreasing subsequence (closely related to the longest increasing subsequence - the way to find them are similar).
By only moving the element not in this sequence, it's easy to see that we'd end up with a sorted list, since all the elements in this sequence are already sorted relative to each other.
If we don't move any elements in this sequence, any other element between two elements in this subsequence is guaranteed to be either greater than the larger element, or smaller than the smaller one (if this is not true, it itself would be in the longest sequence), so it needs to be moved.
(see below for example)
Does it need to be non-decreasing? Yes. Consider if two consecutive elements in this sequence are decreasing. In this case it would be impossible to sort the list without moving those two elements.
To minimize the number of moves required, it's sufficient to pick the longest sequence possible, as done above.
So the total number of moves required is the size of the list minus the size of the longest non-decreasing subsequence.
An example explaining the value of an element not in the non-decreasing subsequence mentioned above:
Consider the longest non-decreasing subsequence 1 x x 2 x x 2 x 4 (the x's are some elements not part of the sequence).
Now consider the possible values for an x between 2 and 4.
If it's 2, 3 or 4, the longest subsequence would include that element. If it's greater than 4 or smaller than 2, it needs to be moved.
It is easy to prove that size of the list - size of longest ordered sequence is enough always to sort any sequence, e.g. with mathematical induction.
You can also easily prove that for some unordered sequences, it is the best what you can do by simply finding such trivial sequence. E.g. to sort the sequence 3, 1, 2 you need one move of one item (3) and it's trivial to see that it cannot be made faster, and obviously size of the list - size of longest ordered sequence is equal to 1.
However, proving that it is the best for all sequences is impossible because that statement is not actually true: A counter example is a sequence with multiple sorted sub-sequences S[i], where max(S[i]) < min(S[i+1]) for every i. For example see the sequence 1, 2, 3, 1000, 4, 5, 6.

Find if any permutation of a number is within a range

I need to find if any permutation of the number exists within a specified range, i just need to return Yes or No.
For eg : Number = 122, and Range = [200, 250]. The answer would be Yes, as 221 exists within the range.
PS:
For the problem that i have in hand, the number to be searched
will only have two different digits (It will only contain 1 and 2,
Eg : 1112221121).
This is not a homework question. It was asked in an interview.
The approach I suggested was to find all permutations of the given number and check. Or loop through the range and check if we find any permutation of the number.
Checking every permutation is too expensive and unnecessary.
First, you need to look at them as strings, not numbers,
Consider each digit position as a seperate variable.
Consider how the set of possible digits each variable can hold is restricted by the range. Each digit/variable pair will be either (a) always valid (b) always invalid; or (c) its validity is conditionally dependent on specific other variables.
Now model these dependencies and independencies as a graph. As case (c) is rare, it will be easy to search in time proportional to O(10N) = O(N)
Numbers have a great property which I think can help you here:
For a given number a of value KXXXX, where K is given, we can
deduce that K0000 <= a < K9999.
Using this property, we can try to build a permutation which is within the range:
Let's take your example:
Range = [200, 250]
Number = 122
First, we can define that the first number must be 2. We have two 2's so we are good so far.
The second number must be be between 0 and 5. We have two candidate, 1 and 2. Still not bad.
Let's check the first value 1:
Any number would be good here, and we still have an unused 2. We have found our permutation (212) and therefor the answer is Yes.
If we did find a contradiction with the value 1, we need to backtrack and try the value 2 and so on.
If none of the solutions are valid, return No.
This Algorithm can be implemented using backtracking and should be very efficient since you only have 2 values to test on each position.
The complexity of this algorithm is 2^l where l is the number of elements.
You could try to implement some kind of binary search:
If you have 6 ones and 4 twos in your number, then first you have the interval
[1111112222; 2222111111]
If your range does not overlap with this interval, you are finished. Now split this interval in the middle, you get
(1111112222 + 222211111) / 2
Now find the largest number consisting of 1's and 2's of the respective number that is smaller than the split point. (Probably this step could be improved by calculating the split directly in some efficient way based on the 1 and 2 or by interpreting 1 and 2 as 0 and 1 of a binary number. One could also consider taking the geometric mean of the two numbers, as the candidates might then be more evenly distributed between left and right.)
[Edit: I think I've got it: Suppose the bounds have the form pq and pr (i.e. p is a common prefix), then build from q and r a symmetric string s with the 1's at the beginning and the end of the string and the 2's in the middle and take ps as the split point (so from 1111112222 and 1122221111 you would build 111122222211, prefix is p=11).]
If this number is contained in the range, you are finished.
If not, look whether the range is above or below and repeat with [old lower bound;split] or [split;old upper bound].
Suppose the range given to you is: ABC and DEF (each character is a digit).
Algorithm permutationExists(range_start, range_end, range_index, nos1, nos2)
if (nos1>0 AND range_start[range_index] < 1 < range_end[range_index] and
permutationExists(range_start, range_end, range_index+1, nos1-1, nos2))
return true
elif (nos2>0 AND range_start[range_index] < 2 < range_end[range_index] and
permutationExists(range_start, range_end, range_index+1, nos1, nos2-1))
return true
else
return false
I am assuming every single number to be a series of digits. The given number is represented as {numberOf1s, numberOf2s}. I am trying to fit the digits (first 1s and then 2s) within the range, if not the procudure returns a false.
PS: I might be really wrong. I dont know if this sort of thing can work. I haven't given it much thought, really..
UPDATE
I am wrong in the way I express the algorithm. There are a few changes that need to be done in it. Here is a working code (It worked for most of my test cases): http://ideone.com/1aOa4
You really only need to check at most TWO of the possible permutations.
Suppose your input number contains only the digits X and Y, with X<Y. In your example, X=1 and Y=2. I'll ignore all the special cases where you've run out of one digit or the other.
Phase 1: Handle the common prefix.
Let A be the first digit in the lower bound of the range, and let B be the first digit in the upper bound of the range. If A<B, then we are done with Phase 1 and move on to Phase 2.
Otherwise, A=B. If X=A=B, then use X as the first digit of the permutation and repeat Phase 1 on the next digit. If Y=A=B, then use Y as the first digit of the permutation and repeat Phase 1 on the next digit.
If neither X nor Y is equal to A and B, then stop. The answer is No.
Phase 2: Done with the common prefix.
At this point, A<B. If A<X<B, then use X as the first digit of the permutation and fill in the remaining digits however you want. The answer is Yes. (And similarly if A<Y<B.)
Otherwise, check the following four cases. At most two of the cases will require real work.
If A=X, then try using X as the first digit of the permutation, followed by all the Y's, followed by the rest of the X's. In other words, make the rest of the permutation as large as possible. If this permutation is in range, then the answer is Yes. If this permutation is not in range, then no permutation starting with X can succeed.
If B=X, then try using X as the first digit of the permutation, followed by the rest of the X's, followed by all the Y's. In other words, make the rest of the permutation as small as possible. If this permutation is in range, then the answer is Yes. If this permutation is not in range, then no permutation starting with X can succeed.
Similar cases if A=Y or B=Y.
If none of these four cases succeed, then the answer is No. Notice that at most one of the X cases and at most one of the Y cases can match.
In this solution, I've assumed that the input number and the two numbers in the range all contain the same number of digits. With a little extra work, the approach can be extended to cases where the numbers of digits differ.

string transposition algorithm

Suppose there is given two String:
String s1= "MARTHA"
String s2= "MARHTA"
here we exchange positions of T and H. I am interested to write code which counts how many changes are necessary to transform from one String to another String.
There are several edit distance algorithms, the given Wikipeida link has links to a few.
Assuming that the distance counts only swaps, here is an idea based on permutations, that runs in linear time.
The first step of the algorithm is ensuring that the two strings are really equivalent in their character contents. This can be done in linear time using a hash table (or a fixed array that covers all the alphabet). If they are not, then s2 can't be considered a permutation of s1, and the "swap count" is irrelevant.
The second step counts the minimum number of swaps required to transform s2 to s1. This can be done by inspecting the permutation p that corresponds to the transformation from s1 to s2. For example, if s1="abcde" and s2="badce", then p=(2,1,4,3,5), meaning that position 1 contains element #2, position 2 contains element #1, etc. This permutation can be broke up into permutation cycles in linear time. The cycles in the example are (2,1) (4,3) and (5). The minimum swap count is the total count of the swaps required per cycle. A cycle of length k requires k-1 swaps in order to "fix it". Therefore, The number of swaps is N-C, where N is the string length and C is the number of cycles. In our example, the result is 2 (swap 1,2 and then 3,4).
Now, there are two problems here, and I think I'm too tired to solve them right now :)
1) My solution assumes that no character is repeated, which is not always the case. Some adjustment is needed to calculate the swap count correctly.
2) My formula #MinSwaps=N-C needs a proof... I didn't find it in the web.
Your problem is not so easy, since before counting the swaps you need to ensure that every swap reduces the "distance" (in equality) between these two strings. Then actually you look for the count but you should look for the smallest count (or at least I suppose), otherwise there exists infinite ways to swap a string to obtain another one.
You should first check which charaters are already in place, then for every character that is not look if there is a couple that can be swapped so that the next distance between strings is reduced. Then iterate over until you finish the process.
If you don't want to effectively do it but just count the number of swaps use a bit array in which you have 1 for every well-placed character and 0 otherwise. You will finish when every bit is 1.

Resources