Longest substring in which all bits are the same (DP algorithm) - algorithm

You're given a bit string that consists of 1 <= n <= 32 bits.
You are also given a sequence of changes that invert some of the bits. If the original string is 001011, and the change is "3", then the changed bit string would be 000011. (The 3rd bit from the right was flipped)
I have to find after each change, the length of the longest substring in which each bit is the same. Therefore, for 000011, the answer would be 4.
I think brute force would just be a sliding window that starts at size of the string and shrinks until the first instance where all the bits in the window are the same.
How would that be altered for a dynamic programming solution?

You can solve this by maintaining a list of indices at which the bit flips. You start by creating that list: shift the sequence by one bit (losing the one on the end), and compare to the original. Any mismatch here is a bit flip. Make a list of those indices:
001011
01011
-234--
In this case, you have bit flips after locations 2, 3, and 4.
Now, you need to develop a simple function to process your change operation. This is very simple: for a change of bit n, you need to change whether indices n-1 and n are in the list: if it's not in the list, add it; if it is in the list, remove it. In the case of changing bit 3, both are in the list, so you now remove them:
---4--
Any time you want to check the longest substring, you need merely check adjacent indices for the largest different. Include 0 and the string length as endpoints. Thus, when you have the list [0, 2, 3, 4, 6], you have a maximum difference of 2 at 2-0 and 6-4. After the change, with the list [0, 4, 6], you have the maximum difference of 4 at 4-0.
If you have a large list with many indices, you can simply maintain differences, altering only the adjacent intervals affected by the single change.
This should get you going; I leave the implementation details to the student. :-)

Related

Algorithm for seeing if many different arrays are subsets of another one?

Let's say I have an array of ~20-100 integers, for example [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (actually numbers more like [106511349 , 173316561, ...], all nonnegative 64-bit integers under 2^63, but for demonstration purposes let's use these).
And many (~50,000) smaller arrays of usually 1-20 terms to match or not match:
1=[2, 3, 8, 20]
2=[2, 3, NOT 8]
3=[2, 8, NOT 16]
4=[2, 8, NOT 16] (there will be duplicates with different list IDs)
I need to find which of these are subsets of the array being tested. A matching list must have all of the positive matches, and none of the negative ones. So for this small example, I would need to get back something like [3, 4]. List 1 fails to match because it requires 20, and list 2 fails to match because it has NOT 8. The NOT can easily be represented by using the high bit/making the number negative in those cases.
I need to do this quickly up to 10,000 times per second . The small arrays are "fixed" (they change infrequently, like once every few seconds), while the large array is done per data item to be scanned (so 10,000 different large arrays per second).
This has become a bit of a bottleneck, so I'm looking into ways to optimize it.
I'm not sure the best data structures or ways to represent this. One solution would be to turn it around and see what small lists we even need to consider:
2=[1, 2, 3, 4]
3=[1, 2]
8=[1, 2, 3, 4]
16=[3, 4]
20=[1]
Then we'd build up a list of lists to check, and do the full subset matching on these. However, certain terms (often the more frequent ones) are going to end up in many of the lists, so there's not much of an actual win here.
I was wondering if anyone is aware of a better algorithm for solving this sort of problem?
you could try to make a tree with the smaller arrays since they change less frequently, such that each subtree tries to halve the number of small arrays left.
For example, do frequency analysis on numbers in the smaller arrays. Find which number is found in closest to half of the smaller arrays. Make that the first check in the tree. In your example that would be '3' since it occurs in half the small arrays. Now that's the head node in the tree. Now put all the small lists that contain 3 to the left subtree and all the other lists to the right subtree. Now repeat this process recursively on each subtree. Then when a large array comes in, reverse index it, and then traverse the subtree to get the lists.
You did not state which of your arrays are sorted - if any.
Since your data is not that big, I would use a hash-map to store the entries of the source set (the one with ~20-100 integers). That would basically let you test if a integer is present in O(1).
Then, given that 50,000(arrays) * 20(terms each) * 8(bytes per term) = 8 megabytes + (hash map overhead), does not seem large either for most systems, I would use another hash-map to store tested arrays. This way you don't have to re-test duplicates.
I realize this may be less satisfying from a CS point of view, but if you're doing a huge number of tiny tasks that don't affect each other, you might want to consider parallelizing them (multithreading). 10,000 tasks per second, comparing a different array in each task, should fit the bill; you don't give any details about what else you're doing (e.g., where all these arrays are coming from), but it's conceivable that multithreading could improve your throughput by a large factor.
First, do what you were suggesting; make a hashmap from input integer to the IDs of the filter arrays it exists in. That lets you say "input #27 is in these 400 filters", and toss those 400 into a sorted set. You've then gotta do an intersection of the sorted sets for each one.
Optional: make a second hashmap from each input integer to it's frequency in the set of filters. When an input comes in, sort it using the second hashmap. Then take the least common input integer and start with it, so you have less overall work to do on each step. Also compute the frequencies for the "not" cases, so you basically get the most bang for your buck on each step.
Finally: this could be pretty easily made into a parallel programming problem; if it's not fast enough on one machine, it seems you could put more machines on it pretty easily, if whatever it's returning is useful enough.

canceling arrays by number of items that I am ready to lose

We are writing c# program that will help us to remove some of unnecessary data repeaters and already found some repeaters to remove with help of this Finding overlapping data in arrays. Now we are going to check maybe we can to cancel some repeaters by other term. The question is:
We have arrays of numbers
{1, 2, 3, 4, 5, 6, 7, ...}, {4, 5, 10, 100}, {100, 1, 20, 50}
some numbers can be repeated in other arrays, some numbers can be unique and to belong only to specific array. We want to remove some arrays when we are ready to lose up to N numbers from the arrays.
Explanation:
{1, 2}
{2, 3, 4, 5}
{2, 7}
We are ready to lose up to 3 numbers from these arrays it means that we can remove array 1 cause we will lose only number "1" it's only unique number. Also we can remove array 1 and 3 cause we will lose numbers "1", "7" or array 3 cause we will lose number "7" only and it less than 3 numbers.
In our output we want to give maximum amount of arrays that can be removed with condition that we going to lose less then N where N is number of items we are ready to lose.
This problem is equivalent to the Set Cover problem (e.g.: take N=0) and thus efficient, exact solutions that work in general are unlikely. However, in practice, heuristics and approximations are often good enough. Given the similarity of your problem with Set Cover, the greedy heuristic is a natural starting point. Instead of stopping when you've covered all elements, stop when you've covered all but N elements.
You need to first get a number for each array which tells you hwo many numbers are unique to that particular array.
An easy way to do this is O(n²) since for each element, you need to check through all arrays if it's unique.
You can do this much more efficiently by having sorted arrays, sorting first or using a heap-like data structure.
After that, you only have to find a sum so that the numbers for a certain amount of arrays sum up to N.That's similar to the subset sum problem, but much less complex because N > 0 and all your numbers are > 0.
So you simply have to sort these numbers from smallest to greatest and then iterate over the sorted array and take the numbers as long as the sum < N.
Finally, you can remove every array that corresponds to a number which you were able to fit into N.

Check for duplicate subsequences of length >= N in sequence

I have a sequence of values and I want to know if it contains an repeated subsequences of a certain minimum length. For instance:
1, 2, 3, 4, 5, 100, 99, 101, 3, 4, 5, 100, 44, 99, 101
Contains the subsequence 3, 4, 5, 100 twice. It also contains the subsequence 99, 101 twice, but that subsequence is two short to care about.
Is there an efficient algorithm for checking the existence of such a subsequence? I'm not especially interested in location the sequences (though that would be helpful for verification), I'm primarily just interested in a True/False answer, given a sequence and a minimum subsequence length.
My only approach so far is to brute force search it: for each item in the sequence, find all the other locations where the item occurs (already at O(N^2)), and then walk forward one step at a time from each location and see if the next item matches, and keep going until I find a mismatch or find a matching subsequence of sufficient length.
Another thought I had but haven't been able to develop into an actual approach is to build a tree of all the sequences, so that each number is a node, and a child of its the number that preceded it, whereever that node happens to already be in the tree.
There are O(k) solutions (k - the length of the whole sequence) for any value of N.
Solution #1: Build a suffix tree for the input sequence(using Ukkonen's algorithm). Iterate over the nodes with two or more children and check if at least one of them has depth >= N.
Solution #2: Build a suffix automaton for the input sequence.Iterate over all the states which right context contains at least two different strings and check if at least one of those nodes has distance >= N from the initial state of the automaton.
Solution #3:Suffix array and the longest common prefix technique can also be used(build the suffix array for input sequence , compute the longest common prefix array, check that there is a pair of adjacent suffices with common prefix with length at least N).
These solutions have O(k) time complexity under the assumption that alphabet size is constant(alphabet consists of all elements of the input sequence).
If it is not the case, it is still possible to obtain O(k log k) worst case time complexity(by storing all transitions in a tree or in an automaton in a map) or O(k) on average using hashmap.
P.S I use terms string and sequence interchangeably here.
If you only care about subsequences of length exactly N (for example, if just want to check that there are no duplicates), then there is a quadratic solution: use the KMP algorithm for every subsequence.
Let's assume that there are k elements in the whole sequence.
For every subsequence of length N (O(k) of them):
Build its failure function (takes O(N))
Search for it in the remainder of the sequence (takes O(k))
So, assuming N << k, the whole algorithm is indeed O(k^2).
Since your list is unordered, you're going to have to visit every item at least once.
What I'm thinking is that you first go through your list and create a dictionary where you store the number as a key along with all the indices it appears in your sequence. Like:
Key: Indices
1: 0
2: 1
3: 2, 8
....
Where the number 1 appears at index 0, the number 2 appears at index 1, the number 3 appears at indices 2 and 8, and so on.
With that created you can then go through the dictionary keys and start comparing it against the sequences at the other locations. This should save on some of the brute force since you don't have to revisit each number through the initial sequence each time.

How to find subsequences?

I'm given a random sequence of numbers of random length, consisting of 0's, 1's, 2's:
201102220221
I'm also given a number: either 1 or 2. I can only loop through the sequence once, and I need to identify all the subsequences of the number I'm given, run that value through another function and add the value to a sum (initialized to 0).
Any 0's can be replaced with a 1 or a 2. Subsequences can therefore be "expanded", by replacing 0's beside it with the number. If a subsequence cannot be expanded to a minimum length of more than 4, I need to discard it.
For example, say I'm given the sequence:
11011002102
and the number 1. I identify the subsequence of length 2 at the beginning (See first element). It can be expanded to a subsequence of length 7, by replacing the first 0, third and fourth zero with 1's. Therefore I run its current length through a function and add it to the sum.
sum += function(2);
I then identify the next subsequence of 1's (See fourth element). It's currently of size 2 and can be expanded to a maximum size of 7, by replacing the zeros around it. So I pass its length to a function and add it to the sum.
sum += function (2);
I finally identify the last subsequence of 1's (See sixth element). It currently has a length of 1 and can be expanded to a maximum size of 2, by replacing the zero beside it with a 1, which is less than 4, so I discard it.
Could someone help me write a function that does what was described above by only looping through the sequence once. Please do not give me real code, just ideas, suggestions or pseudocode.
I don't have any work to share, I'm completely lost. I also know no algorithms, so please don't start talking about things like dynamic programming or linear programming, but instead explain possible ways to approach the problem.
Given the requirement that you can only loop through the sequence once, you know the basic structure of the code.
Given the parameters for discarding a subsequence (expanded length of 4 or more) and for processing a subsequence (unexpanded sequence length), you know what data you need to track along the way. Work out how to best store this data according to your environment and language conventions.
At each iteration of the loop, consider the current character of the input series and how it affects the data stored.
I've tried to clarify the question here without just handing you the solution. Feel free to ask more questions.
Edit:
Consider how you would break the problem down step by step. Here are the iterations of your for loop:
1----------
Okay, we're looking for 1s and we've found one straight up.
-1---------
Cool, another 1, now our subsequence length has increased to 2
--0--------
Right, so as this is not a 1, the length of this subsequence is now known to be 2 - it doesn't increase any more, but as this is a 0, it might still qualify if it can expand to at least 4. The expanded subsequence length is now 3.
---1-------
The expanded subsequence length is now 4! This means we can add the last sequence length to the total sum after passing it through function. This is also the start of another subsequence - so subsequence length is now reset to 1, but expanded subsequence length is still valid at 4. That means that this subsequence is already long enough not to be rejected, but we haven't finished counting it's length at this stage.
----1------
subsequence length = 2, and expanded subsequence length = 5
-----0-----
This marks the end of the 2nd subsequence, process it as before. Etc
------0----
-------2--- <- expanded subsequence length gets reset here
--------1-- <- start of another subsequence
---------0- <- expanded length is 2
----------2 <- expanded length is not long enough for this to qualify, discard it
So, fairly straight forward. There are two factors we need to keep track of: subsequence length and expanded subsequence length.
Once you've got that working, think about what happens for this input sequence "1010101".
Forget the computer for a bit; think of it as a faster version of using pencil and paper.
Try to imagine how you would solve this problem as you traverse each element of the sequence; what you might want to write down and/or edit on your piece of paper at each iteration (each element you reach in the sequence).
For example:
Sequence = 11011002102
Index 0:
Value is 1
Current is 1, Previous was null
Tracking a subsequence of 1's starting at 0 => [1,0] = 1
Index 1:
Value is 1
Current is 1, Previous was 1
Current == Previous so the subsequence length is increased by 1 => [1,0] = 2
...

Subset calculation of list of integers

I'm currently implementing an algorithm where one particular step requires me to calculate subsets in the following way.
Imagine I have sets (possibly millions of them) of integers. Where each set could potentially contain around a 1000 elements:
Set1: [1, 3, 7]
Set2: [1, 5, 8, 10]
Set3: [1, 3, 11, 14, 15]
...,
Set1000000: [1, 7, 10, 19]
Imagine a particular input set:
InputSet: [1, 7]
I now want to quickly calculate to which this InputSet is a subset. In this particular case, it should return Set1 and Set1000000.
Now, brute-forcing it takes too much time. I could also parallelise via Map/Reduce, but I'm looking for a more intelligent solution. Also, to a certain extend, it should be memory-efficient. I already optimised the calculation by making use of BloomFilters to quickly eliminate sets to which the input set could never be a subset.
Any smart technique I'm missing out on?
Thanks!
Well - it seems that the bottle neck is the number of sets, so instead of finding a set by iterating all of them, you could enhance performance by mapping from elements to all sets containing them, and return the sets containing all the elements you searched for.
This is very similar to what is done in AND query when searching the inverted index in the field of information retrieval.
In your example, you will have:
1 -> [set1, set2, set3, ..., set1000000]
3 -> [set1, set3]
5 -> [set2]
7 -> [set1, set7]
8 -> [set2]
...
EDIT:
In inverted index in IR, to save space we sometimes use d-gaps - meaning we store the offset between documents and not the actual number. For example, [2,5,10] will become [2,3,5]. Doing so and using delta encoding to represent the numbers tends to help a lot when it comes to space.
(Of course there is also a downside: you need to read the entire list in order to find if a specific set/document is in it, and cannot use binary search, but it sometimes worths it, especially if it is the difference between fitting the index into RAM or not).
How about storing a list of the sets which contain each number?
1 -- 1, 2, 3, 1000000
3 -- 1, 3
5 -- 2
etc.
Extending amit's solution, instead of storing the actual numbers, you could just store intervals and their associated sets.
For example using a interval size of 5:
(1-5): [1,2,3,1000000]
(6-10): [2,1000000]
(11-15): [3]
(16-20): [1000000]
In the case of (1,7) you should consider intervals (1-5) and (5-10) (which can be determined simply by knowing the size of the interval). Intersecting those ranges gives you [2,1000000]. Binary search of the sets shows that indeed, (1,7) exists in both sets.
Though you'll want to check the min and max values for each set to get a better idea of what the interval size should be. For example, 5 is probably a bad choice if the min and max values go from 1 to a million.
You should probably keep it so that a binary search can be used to check for values, so the subset range should be something like (min + max)/N, where 2N is the max number of values that will need to be binary searched in each set. For example, "does set 3 contain any values from 5 to 10?" this is done by finding the closest values to 5 (3) and 10 (11), in this case, no it does not. You would have to go through each set and do binary searches for the interval values that could be within the set. This means ensuring that you don't go searching for 100 when the set only goes up to 10.
You could also just store the range (min and max). However, the issue is that I suspect your numbers are going be be clustered, thus not providing much use. Although as mentioned, it'll probably be useful for determining how to set up the intervals.
It'll still be troublesome to pick what range to use, too large and it'll take a long time to build the data structure (1000 * million * log(N)). Too small, and you'll start to run into space issues. The ideal size of the range is probably such that it ensures that the number of set's related to each range is approximately equal, while also ensuring that the total number of ranges isn't too high.
Edit:
One benefit is that you don't actually need to store all intervals, just the ones you need. Although, if you have too many unused intervals, it might be wise to increase the interval and split the current intervals to ensure that the search is fast. This is especially true if processioning time isn't a major issue.
Start searching from biggest number (7) of input set and
eliminate other subsets (Set1 and Set1000000 will returned).
Search other input elements (1) in remaining sets.

Resources