Find the lexicographically largest unique string - algorithm

I need an algorithm to find the largest unique (no duplicate characters) substring from a string by removing character (no rearranging).
String A is greater than String B if it satisfies these two conditions.
1. Has more characters than String B
Or
2. Is lexicographically greater than String B if equal length
For example, if the input string is dedede, then the possible unique combinations are de, ed, d, and e.
Of these combinations, the largest one is therefore ed since it has more characters than d and e and is lexicographically greater than de.
The algorithm must more efficient than generating all possible unique strings and sorting them to find the largest one.
Note: this is not a homework assignment.

How about this
string getLargest(string s)
{
int largerest_char_pos=0;
string result="";
if(s.length() == 1) return s;
for(int i=0;i<s.length();)
{
p=i;
for(int j=i+1;j<s.length();j++)
{
if(s[largerest_char_pos]< s[j]) largerest_char_pos =j;
}
res+=s[largerest_char_pos];
i=largerest_char_pos+1;
}
return result;
}
This is code snipet just gives you the lexicigraphically larger string. If you dont want duplicates you can just keep track of already added characters .

Let me state the rules for ordering in a way that I think is more clear.
String A is greater than string B if
- A is longer than B
OR
- A and B are the same length and A is lexicographically greater than B
If my restatement of the rules is correct then I believe I have a solution that runs in O(n^2) time and O(n) space. My solution is a greedy algorithm based on the observation that there are as many characters in the longest valid subsequence as there are unique characters in the input string. I wrote this in Go, and hopefully the comments are sufficient enough to describe the algorithm.
func findIt(str string) string {
// exc keeps track of characters that we cannot use because they have
// already been used in an earlier part of the subsequence
exc := make(map[byte]bool)
// ret is where we will store the characters of the final solution as we
// find them
var ret []byte
for len(str) > 0 {
// inc keeps track of unique characters as we scan from right to left so
// that we don't take a character until we know that we can still make the
// longest possible subsequence.
inc := make(map[byte]bool, len(str))
fmt.Printf("-%s\n", str)
// best is the largest character we have found that can also get us the
// longest possible subsequence.
var best byte
// best_pos is the lowest index that we were able to find best at, we
// always want the lowest index so that we keep as many options open to us
// later if we take this character.
best_pos := -1
// Scan through the input string from right to left
for i := len(str) - 1; i >= 0; i-- {
// Ignore characters we've already used
if _, ok := exc[str[i]]; ok { continue }
if _, ok := inc[str[i]]; !ok {
// If we haven't seen this character already then it means that we can
// make a longer subsequence by including it, so it must be our best
// option so far
inc[str[i]] = true
best = str[i]
best_pos = i
} else {
// If we've already seen this character it might still be our best
// option if it is a lexicographically larger or equal to our current
// best. If it is equal we want it because it is at a lower index,
// which keeps more options open in the future.
if str[i] >= best {
best = str[i]
best_pos = i
}
}
}
if best_pos == -1 {
// If we didn't find any valid characters on this pass then we are done
break
} else {
// include our best character in our solution, and exclude it for
// consideration in any future passes.
ret = append(ret, best)
exc[best] = true
// run the same algorithm again on the substring that is to the right of
// best_pos
str = str[best_pos+1:]
}
}
return string(ret)
}
I am fairly certain you can do this in O(n) time, but I wasn't sure of my solution so I posted this one instead.

Related

Find correct bracket sequence, build from two parts of original bracket sequence

Consider bracket sequences that consist only of '(' and ')'. Let S be any bracket sequence (not necessarily correct) with n items: S[1:n].
I need to write an algorithm, that will find such a number i (from 1 to n, if there is such a number), that S[(i+1):n]+S[1:i] is a correct bracket sequence. I also need this algorithm to have O(n) operations.
It seems to me that I should use deque for this algorithm to pop the last element and push it in the beginning of a deque until a correct bracket sequence appears. But I can't find an efficient way to check, if the new sequence is correct - if I use a special counter, that increases each time '(' appears and decreases otherwise (note that a correct sequence must start with '('), then n operations (to check if the sequence is correct) will be done for each rearrangement of the last element in the beginning and algorithm as a whole takes O(n^2) operations, but I need linear time.
Should I really use deque or is there any other way to check the correctness of the sequence in the deque?
You can do this with a single scan from left to right:
Keep track of the nesting depth of the parentheses. So for example, after "((" it is 2, and after "(()))" it is -1.
Keep track of the position at which this depth hit its minimum value during the scan. This will be the potential splitting point (after the bracket that caused the minimum depth).
At the end of the scan verify that the depth has reached 0. If not, the input is not valid, otherwise return the splitting point.
By consequence, every input that has an equal number of opening and closing parentheses will have a solution.
Here is an implementation of this algorithm in an interactive JavaScript snippet. As you enter the input, the output is updated. It displays the rearranged parts of the input, separated by "...":
function findSplit(s) {
let start = 0;
let depth = 0;
let minDepth = 0;
for (let i = 0; i < s.length; i++) {
if (s[i] == ")") {
depth--;
if (depth < minDepth) {
minDepth = depth;
// restart
start = i+1;
}
} else {
depth++;
}
}
if (depth != 0) return -1; // not valid
return start;
}
// I/O handling
let input = document.querySelector("input");
let output = document.querySelector("pre");
function refresh() {
let s = input.value;
let start = findSplit(s);
if (start == -1) output.textContent = "Invalid";
else output.textContent = s.slice(start) + "..." + s.slice(0, start);
}
input.oninput = refresh;
refresh();
<input value="())()(">
<pre></pre>

Reconstructing a string of words using a dictionary into an English sentence

I am completely stumped. The question is: given you have a string like "thisisasentence" and a function isWord() that returns true if it is an English word, I would get stuck on "this is a sent"
How can I recursively return and keep track of where I am each time?
You need backtracking, which is easily achievable using recursion. Key observation is that you do not need to keep track of where you are past the moment when you are ready to return a solution.
You have a valid "split" when one of the following is true:
The string w is empty (base case), or
You can split non-empty w into substrings p and s, such that p+s=w, p is a word, and s can be split into a sentence (recursive call).
An implementation can return a list of words when successful split is found, or null when it cannot be found. Base case will always return an empty list; recursive case will, upon finding a p, s split that results in non-null return for s, construct a list with p prefixed to the list returned from the recursive call.
The recursive case will have a loop in it, trying all possible prefixes of w. To speed things up a bit, the loop could terminate upon reaching the prefix that is equal in length to the longest word in the dictionary. For example, if the longest word has 12 characters, you know that trying prefixes 13 characters or longer will not result in a match, so you could cut enumeration short.
Just adding to the answer above.
According to my experience, many people understand recursion better when they see a «linearized» version of a recursive algorithm, which means «implemented as a loop over a stack». Linearization is applicable to any recursive task.
Assuming that isWord() has two parameters (1st: string to test; 2nd: its length) and returns a boolean-compatible value, a C implementation of backtracking is as follows:
void doSmth(char *phrase, int *words, int total) {
int i;
for (i = 0; i < total; ++i)
printf("%.*s ", words[i + 1] - words[i], phrase + words[i]);
printf("\n");
}
void parse(char *phrase) {
int current, length, *words;
if (phrase) {
words = (int*)calloc((length = strlen(phrase)) + 2, sizeof(int));
current = 1;
while (current) {
for (++words[current]; words[current] <= length; ++words[current])
if (isWord(phrase + words[current - 1],
words[current] - words[current - 1])) {
words[current + 1] = words[current];
current++;
}
if (words[--current] == length)
doSmth(phrase, words, current); /** parse successful! **/
}
free(words);
}
}
As can be seen, for each word, a pair of stack values are used, the first of which being an offset to the current word`s first character, whereas the second is a potential offset of a character exactly after the current word`s last one (thus being the next word`s first character). The second value of the current word (the one whose pair is at the top of our «stack») is iterated through all characters left in the phrase.
When a word is accepted, a new second value (equalling the current, to only look at positions after it) is pushed to the stack, making the former second the first in a new pair. If the current word (the one just found) completes the phrase, something useful is performed; see doSmth().
If there are no more acceptable words in the remaining part of our phrase, the current word is considered unsuitable, and its second value is discarded from the stack, effectively repeating a search for words at a previous starting location, while the ending location is now farther than the word previously accepted there.

Finding the longest sub-string with no repetition in a string. Time Complexity?

I recently interviewed with a company for software engineering position. I was asked the question of longest unique sub-string in a string. My algorithms was as follows -
Start from the left-most character, and keep storing the character in a hash table with the key as the character and the value as the index_where_it_last_occurred. Add the character to the answer string as long as its not present in the hash table. If we encounter a stored character again, I stop and note down the length. I empty the hash table and then start again from the right index of the repeated character. The right index is retrieved from the (index_where_it_last_occurred) flag. If I ever reach the end of the string, I stop and return the longest length.
For example, say the string was, abcdecfg.
I start with a, store in hash table. I store b and so on till e. Their indexes are stored as well. When I encounter c again, I stop since it's already hashed and note down the length which is 5. I empty the hash table, and start again from the right index of the repeated character. The repeated character being c, I start again from the position 3 ie., the character d. I keep doing this while I don't reach the end of string.
I am interested in knowing what the time complexity of this algorithm will be. IMO, it'll be O(n^2).
This is the code.
import java.util.*;
public class longest
{
static int longest_length = -1;
public static void main(String[] args)
{
Scanner in = new Scanner(System.in);
String str = in.nextLine();
calc(str,0);
System.out.println(longest_length);
}
public static void calc(String str,int index)
{
if(index >= str.length()) return;
int temp_length = 0;
LinkedHashMap<Character,Integer> map = new LinkedHashMap<Character,Integer>();
for (int i = index; i<str.length();i++)
{
if(!map.containsKey(str.charAt(i)))
{
map.put(str.charAt(i),i);
++temp_length;
}
else if(map.containsKey(str.charAt(i)))
{
if(longest_length < temp_length)
{
longest_length = temp_length;
}
int last_index = map.get(str.charAt(i));
// System.out.println(last_index);
calc(str,last_index+1);
break;
}
}
if(longest_length < temp_length)
longest_length = temp_length;
}
}
If the alphabet is of size K, then when you restart counting you jump back at most K-1 places, so you read each character of the string at most K times. So the algorithm is O(nK).
The input string which contains n/K copies of the alphabet exhibits this worst-case behavior. For example if the alphabet is {a, b, c}, strings of the form "abcabcabc...abc" have the property that nearly every character is read 3 times by your algorithm.
You can solve the original problem in O(K+n) time, using O(K) storage space by using dynamic programming.
Let the string be s, and we'll keep a number M which will be the the length of maximum unique_char string ending at i, P, which stores where each character was previously seen, and best, the longest unique-char string found so far.
Start:
Set P[c] = -1 for each c in the alphabet.
M = 0
best = 0
Then, for each i:
M = min(M+1, i-P[s[i]])
best = max(best, M)
P[s[i]] = i
This is trivially O(K) in storage, and O(K+n) in running time.

Algorithm for finding first repeated substring of length k

There is a homework I should do and I need help. I should write a program to find the first substring of length k that is repeated in the string at least twice.
For example in the string "banana" there are two repeated substrings of length 2: "an" , "na". In this case, the answer is "an" because it appeared sooner than "na"
Note that the simple O(n^2) algorithm is not useful since there is time limit on execution time of program so I guess it should be in linear time.
There is a hint too: Use Hash table.
I don't want the code. I just want you to give me a clue because I have no idea how to do this using a hash table. Should I use a specific data structure too?
Iterate over the character indexes of the string (0, 1, 2, ...) up to and including the index of the second-from-last character (i.e. up to strlen(str) - 2). For each iteration, do the following...
Extract the 2-char substring starting at the character index.
Check whether your hashtable contains the 2-char substring. If it does, you've got your answer.
Insert each 2-char substring into the hashtable.
This is easily modifiable to cope with substrings of length k.
Combine Will A's algorithm with a rolling hash to get a linear-time algorithm.
You can use linked hash map.
public static String findRepeated(String s , int k){
Map<String,Integer> map = new LinkedHashMap<String,Integer>();
for(int i = 0 ; i < s.length() - k ; i ++){
String temp = s.substring(i,i +k);
if(!map.containsKey(temp)){
map.put(temp, 1);
}
else{
map.put(temp, map.get(temp) + 1);
}
}
for(Map.Entry<String,Integer> entry : map.entrySet()){
if(entry.getValue() > 1){
return entry.getKey();
}
}
return "no such value";
}

C/C++/Java/C#: help parsing numbers

I've got a real problem (it's not homework, you can check my profile). I need to parse data whose formatting is not under my control.
The data look like this:
6,852:6,100,752
So there's first a number made of up to 9 digits, followed by a colon.
Then I know for sure that, after the colon:
there's at least one valid combination of numbers that add up to the number before the column
I know exactly how many numbers add up to the number before the colon (two in this case, but it can go as high as ten numbers)
In this case, 6852 is 6100 + 752.
My problem: I need to find these numbers (in this example, 6100 + 752).
It is unfortunate that in the data I'm forced to parse, the separator between the numbers (the comma) is also the separator used inside the number themselves (6100 is written as 6,100).
Once again: that unfortunate formatting is not under my control and, once again, this is not homework.
I need to solve this for up to 10 numbers that need to add up.
Here's an example with three numbers adding up to 6855:
6,855:360,6,175,320
I fear that there are cases where there would be two possible different solutions. HOWEVER if I get a solution that works "in most cases" it would be enough.
How do you typically solve such a problem in a C-style bracket language?
Well, I would start with the brute force approach and then apply some heuristics to prune the search space. Just split the list on the right by commas and iterate over all possible ways to group them into n terms (where n is the number of terms in the solution). You can use the following two rules to skip over invalid possibilities.
(1) You know that any group of 1 or 2 digits must begin a term.
(2) You know that no candidate term in your comma delimited list can be greater than the total on the left. (This also tells you the maximum number of digit groups that any candidate term can have.)
Recursive implementation (pseudo code):
int total; // The total read before the colon
// Takes the list of tokens as integers after the colon
// tokens is the set of tokens left to analyse,
// partialList is the partial list of numbers built so far
// sum is the sum of numbers in partialList
// Aggregate takes 2 ints XXX and YYY and returns XXX,YYY (= XXX*1000+YYY)
function getNumbers(tokens, sum, partialList) =
if isEmpty(tokens)
if sum = total return partialList
else return null // Got to the end with the wrong sum
var result1 = getNumbers(tokens[1:end], sum+token[0], Add(partialList, tokens[0]))
var result2 = getNumbers(tokens[2:end], sum+Aggregate(token[0], token[1]), Append(partialList, Aggregate(tokens[0], tokens[1])))
if result1 <> null return result1
if result2 <> null return result2
return null // No solution overall
You can do a lot better from different points of view, like tail recursion, pruning (you can have XXX,YYY only if YYY has 3 digits)... but this may work well enough for your app.
Divide-and-conquer would make for a nice improvement.
I think you should try all possible ways to parse the string and calculate the sum and return a list of those results that give the correct sum. This should be only one result in most cases unless you are very unlucky.
One thing to note that reduces the number of possibilities is that there is only an ambiguity if you have aa,bbb and bbb is exactly 3 digits. If you have aa,bb there is only one way to parse it.
Reading in C++:
std::pair<int,std::vector<int> > read_numbers(std::istream& is)
{
std::pair<int,std::vector<int> > result;
if(!is >> result.first) throw "foo!"
for(;;) {
int i;
if(!is >> i)
if(is.eof()) return result;
else throw "bar!";
result.second.push_back(i);
char ch;
if(is >> ch)
if(ch != ',') throw "foobar!";
is >> std::ws;
}
}
void f()
{
std::istringstream iss("6,852:6,100,752");
std::pair<int,std::vector<int> > foo = read_numbers(iss);
std::vector<int> result = get_winning_combination( foo.first
, foo.second.begin()
, foo.second.end() );
for( std::vector<int>::const_iterator i=result.begin(); i!=result.end(), ++i)
std::cout << *i << " ";
}
The actual cracking of the numbers is left as an exercise to the reader. :)
I think your main problem is deciding how to actually parse the numbers. The rest is just rote work with strings->numbers and iteration over combinations.
For instance, in the examples you gave, you could heuristically decide that a single-digit number followed by a three-digit number is, in fact, a four-digit number. Does a heuristic such as this hold true over a larger dataset? If not, you're also likely to have to iterate over the possible input parsing combinations, which means the naive solution is going to have a big polynomic complexity (O(nx), where x is >4).
Actually checking for which numbers add up is easy to do using a recursive search.
List<int> GetSummands(int total, int numberOfElements, IEnumerable<int> values)
{
if (numberOfElements == 0)
{
if (total == 0)
return new List<int>(); // Empty list.
else
return null; // Indicate no solution.
}
else if (total < 0)
{
return null; // Indicate no solution.
}
else
{
for (int i = 0; i < values.Count; ++i)
{
List<int> summands = GetSummands(
total - values[i], numberOfElements - 1, values.Skip(i + 1));
if (summands != null)
{
// Found solution.
summands.Add(values[i]);
return summands;
}
}
}
}

Resources