The earliest shortest match location - algorithm

Assume you have a function matchWithMagic that returns a Boolean value for a given string. You don't know any detail how it do it but you know the result of true means "Match". Assuming the complexity of this function is linear in time and space to the size of input string.
Now the question is to implement a function that for a given string, that will return an pair of integers, namely pos and len such that matchWithMagic(substring(inputstring,pos,len)) matches and pos is the least number that this can be true (earliest match). When pos is known, len is the least number for a match (shortest match, with a lower priority). There are no requirement on efficiency but the answer should includes performance analyse.
For example, suppose the magic function return true for input strings contains "Good Guy!" or "Bad Guy!" your function should return pos=5,len=8 for "Good Bad Guy!".
Preferred languages is C/C++/Java/JavaScript/C#/Basic, but other languages are OK.
UPDATE
A trivial answer has now been posted. I hope a more efficient solution could appear.

Given that the function is black-box, I'm not sure you can do better than this:
for (int pos = 0:n)
for (int len = 0:(n-pos))
if (matchWithMagic(substring(inputstring,pos,len)))
return {pos, len};
return null;
Which I believe is O(n^3) (assuming matchWithMagic is O(n)).

Related

How to instruct longest palindrome from a list of numbers?

I am trying to solve a question which says that we need to write a function in which given a list of numbers, we need to find the longest palindrome that we can from given only the numbers in the list.
For eg:
If the given list is : [3,47,6,6,5,6,15,22,1,6,15]
The longest palindrome that we can return is one of length 9, such as [6,15,6,3,47,3,6,15,6].
Additionally, we have the following constraints:
One can only use an array queue, array stack, and a chaining hashmap, and the list we are supposed to return, and the function must run in linear time. And we can use only constant additional space.
My approach was the following:
Since a palindrome can be formed if have an even number of certain characters, we can iterate over all the elements in the list, and store in a chaining hash map, the number of times each number appears in the list. This should take O(N) time since each lookup in the chaining hash map takes constant time, and iterating over the list takes linear time.
Then we can iterate over all the numbers in the chaining hash map, to see which numbers appear an even number of times, and accordingly, just make a palindrome. In the worst case, this will take a O(n) linear time.
Now there are two things I am wondering:
How should I make the actual palindrome? Like how do I use the data structures that I am being allowed to use in order to make a palindrome? I am thinking that since the queue is a LIFO data structure, for each number that occurs an even number of times, we add it once to the queue and once to the stack, and so on and so forth. And finally, we can just dequeue everything from the queue, and pop once from the stack, and then add it to the list!
It seems that with my approach, it is taking me two linear runs to solve the question. I am wondering if there is a faster way to do this.
Any and all help will be appreciated. Thanks!
It is not possible to get a better algorithm than one that is O(n), as every number in the input has to be inspected, as it might provide a possibility for a longer palindrome. If indeed the output must be a longest palindrome itself (and not only its length), then producing that output itself represents O(n).
You have also omitted one additional thing you have to do in your algorithm: there can be one value in the final palindrome that occurs only once (in the centre). So whenever you encounter a value that occurs an odd number of times, you may reserve one occurrence of that value for putting in the middle of an odd-length palindrome. The even remainder of the occurrences can be used as usual.
As to your questions:
How should I make the actual palindrome?
There are many ways to do it. But don't forget that if you have an even number of occurrences, you should use all those occurrences, not just two. So add half of them to the queue and half of them to the stack. When the frequency is odd, then still do the same (rounded down), and log the number also as a potential centre value.
When you have done this for all values, then dump the queue and stack together in the result list as you suggested, but don't forget to put the centre value in between the two, if you identified such a centre value (i.e. when not all occurrences were even).
It seems that with my approach, it is taking me two linear runs to solve the question.
You cannot do this better than with a linear time complexity. You can save a bit of time if you use the stack also for the result, and just dump the queue unto the stack (after potentially pushing the centre value).
I've got a solution when its palindrome only for the number and not the digit.
for the input: [51,15]
we will return [15] || [51] and not [51,15] =>(5,1,1,5);
feature more your example as a problem 3 doesn't appear twice(and appears in the answer)
or maybe I didn't understand the question.
public static int[] polidrom(int [] numbers){
HashMap<Integer/*the numbere*/,Integer/*how many time appeared*/> hash = new HashMap<>();
boolean middleFree= false;
int middleNumber = 0;
int space = 0;
Stack<Integer> stack = new Stack<>();
for (Integer num:numbers) {//count how mant times each digit appears
if(hash.containsKey(num)){hash.replace(num,1+hash.get(num));}
else{hash.put(num,1);}
}
for (Integer num:hash.keySet()) {//how many times i can use pairs
int original =hash.get(num);
int insert = (int)Math.floor(hash.get(num)/2);
if(hash.get(num) % 2 !=0){middleNumber = num;middleFree = true;}//as no copy
hash.replace(num,insert);
if(insert != 0){
for(int i =0; i < original;i++){
stack.push(num);
}
}
}
space = stack.size();
if(space == numbers.length){ space--;};//all the numbers are been used
int [] answer = new int[space+1];//the middle dont have to have an pair
int startPointer =0 , endPointer= space;
while (!stack.isEmpty()){
int num = stack.pop();
answer[startPointer] = num;
answer[endPointer] = num;
startPointer++;
endPointer--;
}
if (middleFree){answer[answer.length/2] = middleNumber;}
return answer;
}
space O(n) => {stack , hashMap , answer Array};
complexity: O(n)
You can skip the part where I used the stack and build the answer array in the same loop.
and I can't think of a way where you will not iterate at least twice;
Hope I've helped

Splitting a sentence to minimize sentence lengths

I have come across the following problem statement:
You have a sentence written entirely in a single row. You would like to split it into several rows by replacing some of the spaces
with "new row" indicators. Your goal is to minimize the width of the
longest row in the resulting text ("new row" indicators do not count
towards the width of a row). You may replace at most K spaces.
You will be given a sentence and a K. Split the sentence using the
procedure described above and return the width of the longest row.
I am a little lost with where to start. To me, it seems I need to try to figure out every possible sentence length that satisfies the criteria of splitting the single sentence up into K lines.
I can see a couple of edge cases:
There are <= K words in the sentence, therefore return the longest word.
The sentence length is 0, return 0
If neither of those criteria are true, then we have to determine all possible combinations of splitting the sentence and the return the minimum of all those options. This is the part I don't know how to do (and is obviously the heart of the problem).
You can solve it by inverting the problem. Let's say I fix the length of the longest split to L. Can you compute the minimum number of breaks you need to satisfy it?
Yes, you just break before the first word that would go over L and count them up (O(N)).
So now that we have that we just have to find a minimum L that would require less or equal K breaks. You can do a binary search in the length of the input. Final complexity O(NlogN).
First Answer
What you want to achieve is Minimum Raggedness. If you just want the algorithm, it is here as a PDF. If the research paper's link goes bad, please search for the famous paper named Breaking Paragraphs into Lines by Knuth.
However if you want to get your hands over some implementations of the same, in the question Balanced word wrap (Minimum raggedness) in PHP on SO, people have actually given implementation not only in PHP but in C, C++ and bash as well.
Second Answer
Though this is not exactly a correct approach, it is quick and dirty if you are looking for something like that. This method will not return correct answer for every case. It is for those people for whom time to ship their product is more important.
Idea
You already know the length of your input string. Let's call it L;
When putting in K breaks, the best scenario would be to be able to break the string to parts of exactly L / (K + 1) size;
So break your string at that word which makes the resulting sentence part's length least far from L / (K + 1);
My recursive solution, which can be improved through memoization or dynamic programming.
def split(self,sentence, K):
if not sentence: return 0
if ' ' not in sentence or K == 0: return len(sentence)
spaces = [i for i, s in enumerate(sentence) if s == ' ']
res = 100000
for space in spaces:
res = min(res, max(space, self.split(sentence[space+1:], K-1)))
return res

How to determine whether a string is composed only of letters given by a second string

I have two strings. How can I determine whether the first string is composed only of letters given by the second string?
For example:
A = abcd
B = abbcc
should return false, since d is not in the second string.
A = aab
B = ab
should return true.
If the program most of the time returns false, how can I optimize this program? If it returns true most of the time, how can I optimize it then?
Sort both strings. Then go through A, and have a pointer going through B. If the character in A is the same as what the B pointer points to, keep looking through A. If the character in A is later in the alphabet than what the B pointer points to, advance the B pointer. If the character in A is earlier in the alphabet than what the B pointer points to, return false. If you run out of A, return true.
[How do I] determine [if] the first string [is composed of characters that appear in] the second string?
Here's a quick algorithm:
Treat the first and second strings as two sets of characters S and T.
Perform the set difference S - T. Call the result U.
If U is nonempty, return false. Otherwise, return true.
Here is one simple way.
def isComposedOf(A, B):
bset = set(B)
for c in A:
if c not in bset:
return False
return True
This algorithm walks each string once, so it runs in O(len(A) + len(B)) time.
When the answer is yes, you cannot do better than len(A) comparisons even in the best case, because no matter what you must check every letter. And in the worst case one of the characters in A is hidden very deep in B. So O(len(A) + len(B)) is optimal, as far as worst-case performance is concerned.
Similarly: when the answer is no, you cannot do better than len(B) comparisons even in the best case; and in the worst case the character that isn't in B is hidden very deep in A. So O(len(A) + len(B)) is again optimal.
You can reduce the constant factor by using a better data structure for bset.
You can avoid scanning all of B in some (non-worst) cases where the answer is yes by building it lazily, scanning more of B each time you find a character in A that you haven't seen before.
> If the program is always return false, how to optimize this program?
return false
> If it is always return true, how to optimize it?
return true
EDIT: Seriously, it's a good question what algorithm optimizes for the case of failure, and what algorithm optimizes for the case of success. I don't know what algorithm strstr uses, it may be a generally-good algorithm which isn't optimal for either of those assumptions.
Maybe you'll catch someone here who knows offhand. If not, this looks like a good place to start reading:Exact String Matching Algorithms
assuming strings has all lower case in it, then you can have a bit vector and set the bit based on position position = str1[i] - 'a'. to set it you would do
bitVector |= (1<<pos). And then for str2 you would check if a bit is set in bitVector for all bits, if so return true otherwise return false.

What is a typical algorithm for finding a string within a string?

I recently had an interview question that went something like this:
Given a large string (haystack), find a substring (needle)?
I was a little stumped to come up with a decent solution.
What is the best way to approach this that doesn't have a poor time complexity?
I like the Boyer-Moore algorithm. It's especially fun to implement when you have a lot of needles to find in a haystack (for example, probable-spam patterns in an email corpus).
You can use the Knuth-Morris-Pratt algorithm, which is O(n+m) where n is the length of the "haystack" string and m is the length of the search string.
The general problem is string searching; there are many algorithms and approaches depending on the nature of the application.
Knuth-Morris-Pratt, searches for a string within another string
Boyer-Moore, another string within string search
Aho-Corasick; searches for words from a reference dictionary in a given arbitrary text
Some advanced index data structures are also used in other applications. Suffix trees are used a lot in bioinformatics; here you have one long reference text, and then you have many arbitrary strings that you want to find all occurrences for. Once the index (i.e. the tree) is built, finding patterns can be done quite efficiently.
For an interview answer, I think it's better to show breadth as well. Knowing about all these different algorithms and what specific purposes they serve best is probably better than knowing just one algorithm by heart.
I do not believe that there is anything better then looking at the haystack one character at a time and try to match them against the needle.
This will have linear time complexity (against the length of the haystack). It could degrade if the needle is near the same length as the haystack and shares a long common and repeated prefix.
A typical algorithm would be as follows, with string indexes ranging from 0 to M-1.
It returns either the position of the match or -1 if not found.
foreach hpos in range 0 thru len(haystack)-len(needle):
found = true
foreach npos in range 0 thru len(needle):
if haystack[hpos+npos] != needle[npos]:
found = false
break
if found:
return hpos
return -1
It has a reasonable performance since it only checks as many characters in each haystack position as needed to discover there's no chance of a match.
It's not necessarily the most efficient algorithm but if your interviewer expects you to know all the high performance algorithms off the top of your head, they're being unrealistic (i.e., fools). A good developer knows the basics well, and how to find out the advanced stuff where necessary (e.g., when there's a performance problem, not before).
The performance ranges between O(a) and O(ab) depending on the actual data in the strings, where a and b are the haystack and needle lengths respectively.
One possible improvement is to store, within the npos loop, the first location greater than hpos, where the character matches the first character of the needle.
That way you can skip hpos forward in the next iteration since you know there can be no possible match before that point. But again, that may not be necessary, depending on your performance requirements. You should work out the balance between speed and maintainability yourself.
This problem is discussed in Hacking a Google Interview Practice Questions – Person A. Their sample solution:
bool hasSubstring(const char *str, const char *find) {
if (str[0] == '\0' && find[0] == '\0')
return true;
for(int i = 0; str[i] != '\0'; i++) {
bool foundNonMatch = false;
for(int j = 0; find[j] != '\0'; j++) {
if (str[i + j] != find[j]) {
foundNonMatch = true;
break;
}
}
if (!foundNonMatch)
return true;
}
return false;
}
Here is a presentation of a few algorithm (shamelessly linked from Humboldt University). contains a few good algorithms such as Boyer More and Z-box.
I did use the Z-Box algorithm, found it to work well and was more efficient than Boyer-More, however needs some time to wrap your head around it.
the easiest way I think is "haystack".Contains("needle") ;)
just fun,dont take it seriously.I think already you have got your answer.

Palindrome detection efficiency

I got curious by Jon Limjap's interview mishap and started to look for efficient ways to do palindrome detection. I checked the palindrome golf answers and it seems to me that in the answers are two algorithms only, reversing the string and checking from tail and head.
def palindrome_short(s):
length = len(s)
for i in xrange(0,length/2):
if s[i] != s[(length-1)-i]: return False
return True
def palindrome_reverse(s):
return s == s[::-1]
I think neither of these methods are used in the detection of exact palindromes in huge DNA sequences. I looked around a bit and didn't find any free article about what an ultra efficient way for this might be.
A good way might be parallelizing the first version in a divide-and-conquer approach, assigning a pair of char arrays 1..n and length-1-n..length-1 to each thread or processor.
What would be a better way?
Do you know any?
Given only one palindrome, you will have to do it in O(N), yes. You can get more efficiency with multi-processors by splitting the string as you said.
Now say you want to do exact DNA matching. These strings are thousands of characters long, and they are very repetitive. This gives us the opportunity to optimize.
Say you split a 1000-char long string into 5 pairs of 100,100. The code will look like this:
isPal(w[0:100],w[-100:]) and isPal(w[101:200], w[-200:-100]) ...
etc... The first time you do these matches, you will have to process them. However, you can add all results you've done into a hashtable mapping pairs to booleans:
isPal = {("ATTAGC", "CGATTA"): True, ("ATTGCA", "CAGTAA"): False}
etc... this will take way too much memory, though. For pairs of 100,100, the hash map will have 2*4^100 elements. Say that you only store two 32-bit hashes of strings as the key, you will need something like 10^55 megabytes, which is ridiculous.
Maybe if you use smaller strings, the problem can be tractable. Then you'll have a huge hashmap, but at least palindrome for let's say 10x10 pairs will take O(1), so checking if a 1000 string is a palindrome will take 100 lookups instead of 500 compares. It's still O(N), though...
Another variant of your second function. We need no check equals of the right parts of normal and reverse strings.
def palindrome_reverse(s):
l = len(s) / 2
return s[:l] == s[l::-1]
Obviously, you're not going to be able to get better than O(n) asymptotic efficiency, since each character must be examined at least once. You can get better multiplicative constants, though.
For a single thread, you can get a speedup using assembly. You can also do better by examining data in chunks larger than a byte at a time, but this may be tricky due to alignment considerations. You'll do even better to use SIMD, if you can examine chunks as large as 16 bytes at a time.
If you wanted to parallelize it, you could divide the string into N pieces, and have processor i compare the segment [i*n/2, (i+1)*N/2) with the segment [L-(i+1)*N/2, L-i*N/2).
There isn't, unless you do a fuzzy match. Which is what they probably do in DNA (I've done EST searching in DNA with smith-waterman, but that is obviously much harder then matching for a palindrome or reverse-complement in a sequence).
They are both in O(N) so I don't think there is any particular efficiency problem with any of these solutions. Maybe I am not creative enough but I can't see how would it be possible to compare N elements in less than N steps, so something like O(log N) is definitely not possible IMHO.
Pararellism might help, but it still wouldn't change the big-Oh rank of the algorithm since it is equivalent to running it on a faster machine.
Comparing from the center is always much more efficient since you can bail out early on a miss but it alwo allows you to do faster max palindrome search, regardless of whether you are looking for the maximal radius or all non-overlapping palindromes.
The only real paralellization is if you have multiple independent strings to process. Splitting into chunks will waste a lot of work for every miss and there's always much more misses than hits.
On top of what others said, I'd also add a few pre-check criteria for really large inputs :
quick check whether tail-character matches
head character
if NOT, just early exit by returning Boolean-False
if (input-length < 4) {
# The quick check just now already confirmed it's palindrome
return Boolean-True
} else if (200 < input-length) {
# adjust this parameter to your preferences
#
# e.g. make it 20 for longer than 8000 etc
# or make it scale to input size,
# either logarithmically, or a fixed ratio like 2.5%
#
reverse last ( N = 4 ) characters/bytes of the input
if that **DOES NOT** match first N chars/bytes {
return boolean-false # early exit
# no point to reverse rest of it
# when head and tail don't even match
} else {
if N was substantial
trim out the head and tail of the input
you've confirmed; avoid duplicated work
remember to also update the variable(s)
you've elected to track the input size
}
[optional 1 : if that substring of N characters you've
just checked happened to all contain the
same character, let's call it C,
then locate the index position, P, for the first
character that isn't C
if P == input-size
then you've already proven
the entire string is a nonstop repeat
of one single character, which, by def,
must be a palindrome
then just return Boolean-True
but the P is more than the half-way point,
you've also proven it cannot possibly be a
palindrome, because second half contains a
component that doesn't exist in first half,
then just return Boolean-False ]
[optional 2 : for extremely long inputs,
like over 200,000 chars,
take the N chars right at the middle of it,
and see if the reversed one matches
if that fails, then do early exit and save time ]
}
if all pre-checks passed,
then simply do it BAU style :
reverse second-half of it,
and see if it's same as first half
With Python, short code can be faster since it puts the load into the faster internals of the VM (And there is the whole cache and other such things)
def ispalin(x):
return all(x[a]==x[-a-1] for a in xrange(len(x)>>1))
You can use a hashtable to put the character and have a counter variable whose value increases everytime you find an element not in table/map. If u check and find element thats already in table decrease the count.
For odd lettered string the counter should be back to 1 and for even it should hit 0.I hope this approach is right.
See below the snippet.
s->refers to string
eg: String s="abbcaddc";
Hashtable<Character,Integer> textMap= new Hashtable<Character,Integer>();
char charA[]= s.toCharArray();
for(int i=0;i<charA.length;i++)
{
if(!textMap.containsKey(charA[i]))
{
textMap.put(charA[i], ++count);
}
else
{
textMap.put(charA[i],--count);
}
if(length%2 !=0)
{
if(count == 1)
System.out.println("(odd case:PALINDROME)");
else
System.out.println("(odd case:not palindrome)");
}
else if(length%2==0)
{
if(count ==0)
System.out.println("(even case:palindrome)");
else
System.out.println("(even case :not palindrome)");
}
public class Palindrome{
private static boolean isPalindrome(String s){
if(s == null)
return false; //unitialized String ? return false
if(s.isEmpty()) //Empty Strings is a Palindrome
return true;
//we want check characters on opposite sides of the string
//and stop in the middle <divide and conquer>
int left = 0; //left-most char
int right = s.length() - 1; //right-most char
while(left < right){ //this elegantly handles odd characters string
if(s.charAt(left) != s.charAt(right)) //left char must equal
return false; //right else its not a palindrome
left++; //converge left to right
right--;//converge right to left
}
return true; // return true if the while doesn't exit
}
}
though we are doing n/2 calculations its still O(n)
this can done also using threads, but calculations get messy, best to avoid it. this doesn't test for special characters and is case sensitive. I have code that does it, but this code can be modified to handle that easily.

Resources