How to get the smallest in lexicographical order? - algorithm

I am doing a leetcode exercise
https://leetcode.com/problems/remove-duplicate-letters/
The question is:
# Given a string which contains only lowercase letters, remove duplicate
# letters so that every letter appear once and only once. You must make
# sure your result is the smallest in lexicographical order among all possible results.
#
# Example:
# Given "bcabc"
# Return "abc"
#
# Given "cbacdcbc"
# Return "acdb"
I am not quite sure about what is the smallest in lexicographical order and why Given "cbacdcbc" then the answer would be "acdb"
Thanks for the answer in advance :)

The smallest lexicographical order is an order relation where string s is smaller than t, given the first character of s (s1) is smaller than the first character of t (t1), or in case they are equivalent, the second character, etc.
So aaabbb is smaller than aaac because although the first three characters are equal, the fourth character b is smaller than the fourth character c.
For cbacdcbc, there are several options, since b and c are duplicates, you can decided which duplicates to remove. This results in:
cbacdcbc = adbc
cbacdcbc = adcb
cbacdcbc = badc
cbacdcbc = badc
...
since adbc < adcb, you cannot thus simply answer with the first answer that pops into your mind.

You cannot reorder characters. You can only choose which occurrence to remove in case of duplicated characters.
bcabc
We can remove either first b or second b, we can remove either first c or second c. All together four outputs:
..abc
.cab.
b.a.c
bca..
Sort these four outputs lexicographically (alphabetically):
abc
bac
bca
cab
And take the first one:
abc

clearly, the wanted output must contain only letter once.
now, from what i understand, you must pick the letters in a manner that will give you the best order when the leftmost letters come before in (abc? ascii?)
now you'd ask why "acdb" than and not "abcd". i think you don't take the first "cb" since you more c and b later, but you're have to take the "a" since there's only one coming now. then you must take c 'cause there are no more "d" after the next b. that's why you take c, and then d because no more d's later.
in short, you want to take it with best lexicographical order from low to high, but make sure you take all the letters while iterating over the input string.

String comparison usually can be done in 2 ways:
compare for first unmatched letter (called lexicographical ) for example aacccccc is less than ab because at second position b has been met (and a < b).
compare string length first and shorter string is treated as less. If strings length are equal then apply lexicographical.
Second one may be faster if length of strings are known.
You question contains small error:
why Given "bcabc" then the answer would be "acdb"
While origin was: "Given "bcabc" Return "abc"". That make sense that abc should be returned instead of bca

There seems to be some misunderstanding; the example states that for the input bcabc, the expected output should be abc, not acdb, which refers to the input cbacdcbc.

the smallest in lexicographical order - your answer should be a subsequence of initial string, containing one instance of every char.
If there are many such subsequences possible (bca, bac, cab, abc for the first example), return the smallest one, comparing them as strings (consider string order in vocabulary).
why Given "bcabc" then the answer would be "acdb"
You confused two different examples

Related

An idea for a data structure that will store sentences that differ in one word

I have a file consist lot of lines like:
John is running at night
John is not walking at night
Jack is running at night
Jack is waiting for someone
John is waiting for someone
and I need to write a program that will group similar sentences and print them to a file.
Similar sentences are sentences that only a single word has been changed between them.
For example, the output file should look like:
John is running at night
Jack is running at night
The changing word was: Jhon, Jack
Jack is waiting for someone
John is waiting for someone
The changing word was: Jhon, Jack
I thought to implement it by parsing the file and arrange the strings in groups of a number of words in each string(all string that has 6 words will be group together and all string that has 5 words will be group together and so on)
After arranging to groups I can split each string to a set of words and compare each string to another string and check for a match.
I think my solution is not efficient.
Does anyone have a better solution he can think of?
Let us assume there are M sentences with an average of N words each. For every sentence we wish to produce a list of indices of other sentences (up to M - 1) that differ by exactly one word. Thus, the input size is O(MN) words and the output size is O(M²) numbers. Here is an algorithm that runs in O(MN + M²) and is therefore optimal.
First, read all the sentences, split them into words and index the words in a hash tables. Thus, we can think of sentences of arrays. To help our thought process, we can further think of sentences as Latin lowercase strings by replacing each initial word with a letter (this works up to 26 distinct words).
Now we wish to be able to query each pair of strings (A, B) in O(1) and ask "do A and B differ by exactly one letter"? To anwser,
let l be the common length of A and B;
let p be the length of the common prefix between A and B;
let s be the length of the common suffix between A and B;
then notice that A and B differ by exactly one letter if l = p + s + 1.
Therefore, our algorithm boils down to determining, in constant time, the length of the common prefix and common suffix for every pair of strings. We show how to do this for prefixes. The same approach works for suffixes, e.g. by reversing the strings.
First, sort the strings and measure the common prefixes between each consecutive pair. For example:
banana
> common prefix 3 ("ban")
band
> common prefix 4
bandit
> common prefix 1
brother
> common prefix 7
brotherly
> common prefix 0
car
Now, suppose you want to query the common prefix between "band" and "brotherly". This will be the minimum numeric value between "band" and "brotherly", or min(4, 1, 7) = 1. This can be achieved with range minimum queries in O(M) processing time and and O(1) per query, although simpler implementations are available in O(M log M) preprocessing time.

scrabble solving with maximum score

I was asked a question
You are given a list of characters, a score associated with each character and a dictionary of valid words ( say normal English dictionary ). you have to form a word out of the character list such that the score is maximum and the word is valid.
I could think of a solution involving a trie made out of dictionary and backtracking with available characters, but could not formulate properly. Does anyone know the correct approach or come up with one?
First iterate over your letters and count how many times do you have each of the characters in the English alphabet. Store this in a static, say a char array of size 26 where first cell corresponds to a second to b and so on. Name this original array cnt. Now iterate over all words and for each word form a similar array of size 26. For each of the cells in this array check if you have at least as many occurrences in cnt. If that is the case, you can write the word otherwise you can't. If you can write the word you compute its score and maximize the score in a helper variable.
This approach will have linear complexity and this is also the best asymptotic complexity you can possibly have(after all the input you're given is of linear size).
Inspired by Programmer Person's answer (initially I thought that approach was O(n!) so I discarded it). It needs O(nr of words) setup and then O(2^(chars in query)) for each question. This is exponential, but in Scrabble you only have 7 letter tiles at a time; so you need to check only 128 possibilities!
First observation is that the order of characters in query or word doesn't matter, so you want to process your list into a set of bag of chars. A way to do that is to 'sort' the word so "bac", "cab" become "abc".
Now you take your query, and iterate all possible answers. All variants of keep/discard for each letter. It's easier to see in binary form: 1111 to keep all, 1110 to discard the last letter...
Then check if each possibility exists in your dictionary (hash map for simplicity), then return the one with the maximum score.
import nltk
from string import ascii_lowercase
from itertools import product
scores = {c:s for s, c in enumerate(ascii_lowercase)}
sanitize = lambda w: "".join(c for c in w.lower() if c in scores)
anagram = lambda w: "".join(sorted(w))
anagrams = {anagram(sanitize(w)):w for w in nltk.corpus.words.words()}
while True:
query = input("What do you have?")
if not query: break
# make it look like our preprocessed word list
query = anagram(sanitize(query))
results = {}
# all variants for our query
for mask in product((True, False), repeat=len(query)):
# get the variant given the mask
masked = "".join(c for i, c in enumerate(query) if mask[i])
# check if it's valid
if masked in anagrams:
# score it, also getting the word back would be nice
results[sum(scores[c] for c in masked)] = anagrams[masked]
print(*max(results.items()))
Build a lookup trie of just the sorted-anagram of each word of the dictionary. This is a one time cost.
By sorted anagram I mean: if the word is eat you represent it as aet. It the word is tea, you represent it as aet, bubble is represent as bbbelu etc
Since this is scrabble, assuming you have 8 tiles (say you want to use one from the board), you will need to maximum check 2^8 possibilities.
For any subset of the tiles from the set of 8, you sort the tiles, and lookup in the anagram trie.
There are at most 2^8 such subsets, and this could potentially be optimized (in case of repeating tiles) by doing a more clever subset generation.
If this is a more general problem, where 2^{number of tiles} could be much higher than the total number of anagram-words in the dictionary, it might be better to use frequency counts as in Ivaylo's answer, and the lookups potentially can be optimized using multi-dimensional range queries. (In this case 26 dimensions!)
Sorry, this might not help you as-is (I presume you are trying to do some exercise and have constraints), but I hope this will help the future readers who don't have those constraints.
If the number of dictionary entries is relatively small (up to a few million) you can use brute force: For each word, create a 32 bit mask. Preprocess the data: Set one bit if the letter a/b/c/.../z is used. For the six most common English characters etaoin set another bit if the letter is used twice.
Create a similar bitmap for the letters that you have. Then scan the dictionary for words where all bits that are needed for the word are set in the bitmap for the available letters. You have reduced the problem to words where you have all needed characters once, and the six most common characters twice if the are needed twice. You'll still have to check if a word can be formed in case you have a word like "bubble" and the first test only tells you that you have letters b,u,l,e but not necessarily 3 b's.
By also sorting the list of words by point values before doing the check, the first hit is the best one. This has another advantage: You can count the points that you have, and don't bother checking words with more points. For example, bubble has 12 points. If you have only 11 points, then there is no need to check this word at all (have a small table with the indexes of the first word with any given number of points).
To improve anagrams: In the table, only store different bitmasks with equal number of points (so we would have entries for bubble and blue because they have different point values, but not for team and mate). Then store all the possible words, possibly more than one, for each bit mask and check them all. This should reduce the number of bit masks to check.
Here is a brute force approach in python, using an english dictionary containing 58,109 words. This approach is actually quite fast timing at about .3 seconds on each run.
from random import shuffle
from string import ascii_lowercase
import time
def getValue(word):
return sum(map( lambda x: key[x], word))
if __name__ == '__main__':
v = range(26)
shuffle(v)
key = dict(zip(list(ascii_lowercase), v))
with open("/Users/james_gaddis/PycharmProjects/Unpack Sentance/hard/words.txt", 'r') as f:
wordDict = f.read().splitlines()
f.close()
valued = map(lambda x: (getValue(x), x), wordDict)
print max(valued)
Here is the dictionary I used, with one hyphenated entry removed for convenience.
Can we assume that the dictionary is fixed and the score are fixed and that only the letters available will change (as in scrabble) ? Otherwise, I think there is no better than looking up each word of the dictionnary as previously suggested.
So let's assume that we are in this setting. Pick an order < that respects the costs of letters. For instance Q > Z > J > X > K > .. > A >E >I .. > U.
Replace your dictionary D with a dictionary D' made of the anagrams of the words of D with letters ordered by the previous order (so the word buzz is mapped to zzbu, for instance), and also removing duplicates and words of length > 8 if you have at most 8 letters in your game.
Then construct a trie with the words of D' where the children nodes are ordered by the value of their letters (so the first child of the root would be Q, the second Z, .., the last child one U). On each node of the trie, also store the maximal value of a word going through this node.
Given a set of available characters, you can explore the trie in a depth first manner, going from left to right, and keeping in memory the current best value found. Only explore branches whose node's value is larger than you current best value. This way, you will explore only a few branches after the first ones (for instance, if you have a Z in your game, exploring any branch that start with a one point letter as A is discarded, because it will score at most 8x1 which is less than the value of Z). I bet that you will explore only a very few branches each time.

Algorithm to find minimum length in main string to find second string [duplicate]

This question already has answers here:
Find length of smallest window that contains all the characters of a string in another string
(8 answers)
Closed 9 years ago.
There is a one question of Algorithm.
Question is as follows:-
You are given a protein String consisting of characters, A, B, C, D. You have to find a minimum length sequence in that.
Example
0 1 2 3 4 5 6 7 8 9 10 11 12
A B A C D C A B C D C C D
String to find is : BCD
This string is find between (StartPoint, EndPoint)
1, 4
7, 9
1, 12
7, 12
Minimum length is of 7, 9.
So the answer is 7, 9
My work,
We can solve this using Brute force approach in O(n^2).
We can find the first sequence present in main string by using DP, and my DP logic is as follows,
A = Main string
B = String to be find
DP = Dynamic programming function
n = A.size, m = B.size
Build an array of DP[m+1][n+1]
DP[i][j], means If in A[0...i], B[0...j] is present or not.
This way we can find our first occurence of B in A. Now after this, I am stuck.
I need some hint from your side.
Please give me hint/guidance only, no code or implementation required.
Your sample problem and its solution clearly suggests that the solution will always be a numeric pair containing position of first letter of substring and position of last letter of substring i.e.
If the substring is BCD, then solution will be position of B, position of D
Provided that rest of the substring (C in this case) falls in between the solution pair.
So, to give a hint, we can start by finding the positions of first letter of substring in main string and store those positions in an array. Similarly we can find the positions of last letter of substring and store them in an array. This will give us a set of probable solution set wherein each pair will comprise of one number from array 1 and one number from array 2 such that number from array 2 is greater than number from array 1. Now we might end up observing that there is no such pair, which means there is no solution i.e. substring does not exist in main string, or we might end up find one or more such pairs, that means there can be a solution. Now all that is left to do is find out if rest of the substring exists between the solution pairs or not. If there are more than one such pairs found at the end, then just higher number minus the lower number should resolve to the right solution. Hope this helps, as you mentioned you do not wish to know the entire answer, you are just looking for hint :)
Based on the example, I'm assuming the search string needs to be found in the same order as given (i.e. ACB isn't a valid find for ABC).
General DP approach / hints:
The function we're trying to minimize is the distance so far, so this should be the value stored in each cell of your matrix.
For some position in the string and some position in the search string, we need to look back to all previous positions in the string for one position back in the search string. For all of these we need to add the distance to there and record the minimum.
To illustrate, assume a search string of A, B, C, D. Then for ABC in the search string and position i in the string, we need to look at positions 0 through i-1 for AB.
Given a string BACCD and a search string BCD, when looking at the last position of both, we'd have something like:
DP(BACCD, BCD) = min(4+DP(B, BC), 3+DP(BA, BC), 2+DP(BAC, BC), 1+DP(BACC, BC))
But DP(B, BC) and DP(BA, BC) are invalid since B and BA don't contain BC and, more specifically, don't end with a C (thus they can be assigned some arbitrary large value).
Once we get to the last character in the search string, the value would indicate we found the complete search string, ending at that position in the string, thus it should compared to the global minimum.
Optimization:
To get an O(m*n) rather than O(m*n^2) running time, it's worth noting that you can stop iterating backwards as soon as you see another of the current letter (because, any sequence up to that point is longer than the same sequence with only the last letter moved forward), i.e.:
Given a string ABCCD and a search string ABC, when checking the second C, we can stop as soon as we get to the first C (which is right away), since ABC is shorter than ABCC.
Side note:
I think one can do better than the DP approach, but if I were to suggest something else here, it would likely just be copied from / inspired by one of the answers to Find length of smallest window that contains all the characters of a string in another string.

Make palindrome from given word

I have given word like abca. I want to know how many letters do I need to add to make it palindrome.
In this case its 1, because if I add b, I get abcba.
First, let's consider an inefficient recursive solution:
Suppose the string is of the form aSb, where a and b are letters and S is a substring.
If a==b, then f(aSb) = f(S).
If a!=b, then you need to add a letter: either add an a at the end, or add a b in the front. We need to try both and see which is better. So in this case, f(aSb) = 1 + min(f(aS), f(Sb)).
This can be implemented with a recursive function which will take exponential time to run.
To improve performance, note that this function will only be called with substrings of the original string. There are only O(n^2) such substrings. So by memoizing the results of this function, we reduce the time taken to O(n^2), at the cost of O(n^2) space.
The basic algorithm would look like this:
Iterate over the half the string and check if a character exists at the appropriate position at the other end (i.e., if you have abca then the first character is an a and the string also ends with a).
If they match, then proceed to the next character.
If they don't match, then note that a character needs to be added.
Note that you can only move backwords from the end when the characters match. For example, if the string is abcdeffeda then the outer characters match. We then need to consider bcdeffed. The outer characters don't match so a b needs to be added. But we don't want to continue with cdeffe (i.e., removing/ignoring both outer characters), we simply remove b and continue with looking at cdeffed. Similarly for c and this means our algorithm returns 2 string modifications and not more.

Find the prefix substring which gives best compression

Problem:
Given a list of strings, find the substring which, if subtracted from the beginning of all strings where it matches and replaced by an escape byte, gives the shortest total length.
Example:
"foo", "fool", "bar"
The result is: "foo" as the base string with the strings "\0", "\0l", "bar" and a total length of 9 bytes. "\0" is the escape byte. The sum of the length of the original strings is 10, so in this case we only saved one byte.
A naive algorithm would look like:
for string in list
for i = 1, i < length of string
calculate total length based on prefix of string[0..i]
if better than last best, save it
return the best prefix
That will give us the answer, but it's something like O((n*m)^2), which is too expensive.
Use a forest of prefix trees (trie)...
f_2 b_1
/ |
o_2 a_1
| |
o_2 r_1
|
l_1
then, we can find the best result, and guarantee it, by maximizing (depth * frequency) which will be replaced with your escape character. You can optimize the search by doing a branch and bound depth first search for the maximum.
On the complexity: O(C), as mentioned in comment, for building it, and for finding the optimal, it depends. If you order the first elements frequency (O(A) --where A is the size of the languages alphabet), then you'll be able to cut out more branches, and have a good chance of getting sub-linear time.
I think this is clear, I am not going to write it up --what is this a homework assignment? ;)
I would try starting by sorting the list. Then you simply go from string to string comparing the first character to the next string's first char. Once you have a match you would look at the next char. You would need to devise a way to track the best result so far.
Well, first step would be to sort the list. Then one pass through the list, comparing each element with the previous, keeping track of the longest 2-character, 3-character, 4-character etc runs. Then figure is the 20 3-character prefixes better than the 15 4-character prefixes.

Resources