Palindrome detection efficiency - algorithm

I got curious by Jon Limjap's interview mishap and started to look for efficient ways to do palindrome detection. I checked the palindrome golf answers and it seems to me that in the answers are two algorithms only, reversing the string and checking from tail and head.
def palindrome_short(s):
length = len(s)
for i in xrange(0,length/2):
if s[i] != s[(length-1)-i]: return False
return True
def palindrome_reverse(s):
return s == s[::-1]
I think neither of these methods are used in the detection of exact palindromes in huge DNA sequences. I looked around a bit and didn't find any free article about what an ultra efficient way for this might be.
A good way might be parallelizing the first version in a divide-and-conquer approach, assigning a pair of char arrays 1..n and length-1-n..length-1 to each thread or processor.
What would be a better way?
Do you know any?

Given only one palindrome, you will have to do it in O(N), yes. You can get more efficiency with multi-processors by splitting the string as you said.
Now say you want to do exact DNA matching. These strings are thousands of characters long, and they are very repetitive. This gives us the opportunity to optimize.
Say you split a 1000-char long string into 5 pairs of 100,100. The code will look like this:
isPal(w[0:100],w[-100:]) and isPal(w[101:200], w[-200:-100]) ...
etc... The first time you do these matches, you will have to process them. However, you can add all results you've done into a hashtable mapping pairs to booleans:
isPal = {("ATTAGC", "CGATTA"): True, ("ATTGCA", "CAGTAA"): False}
etc... this will take way too much memory, though. For pairs of 100,100, the hash map will have 2*4^100 elements. Say that you only store two 32-bit hashes of strings as the key, you will need something like 10^55 megabytes, which is ridiculous.
Maybe if you use smaller strings, the problem can be tractable. Then you'll have a huge hashmap, but at least palindrome for let's say 10x10 pairs will take O(1), so checking if a 1000 string is a palindrome will take 100 lookups instead of 500 compares. It's still O(N), though...

Another variant of your second function. We need no check equals of the right parts of normal and reverse strings.
def palindrome_reverse(s):
l = len(s) / 2
return s[:l] == s[l::-1]

Obviously, you're not going to be able to get better than O(n) asymptotic efficiency, since each character must be examined at least once. You can get better multiplicative constants, though.
For a single thread, you can get a speedup using assembly. You can also do better by examining data in chunks larger than a byte at a time, but this may be tricky due to alignment considerations. You'll do even better to use SIMD, if you can examine chunks as large as 16 bytes at a time.
If you wanted to parallelize it, you could divide the string into N pieces, and have processor i compare the segment [i*n/2, (i+1)*N/2) with the segment [L-(i+1)*N/2, L-i*N/2).

There isn't, unless you do a fuzzy match. Which is what they probably do in DNA (I've done EST searching in DNA with smith-waterman, but that is obviously much harder then matching for a palindrome or reverse-complement in a sequence).

They are both in O(N) so I don't think there is any particular efficiency problem with any of these solutions. Maybe I am not creative enough but I can't see how would it be possible to compare N elements in less than N steps, so something like O(log N) is definitely not possible IMHO.
Pararellism might help, but it still wouldn't change the big-Oh rank of the algorithm since it is equivalent to running it on a faster machine.

Comparing from the center is always much more efficient since you can bail out early on a miss but it alwo allows you to do faster max palindrome search, regardless of whether you are looking for the maximal radius or all non-overlapping palindromes.
The only real paralellization is if you have multiple independent strings to process. Splitting into chunks will waste a lot of work for every miss and there's always much more misses than hits.

On top of what others said, I'd also add a few pre-check criteria for really large inputs :
quick check whether tail-character matches
head character
if NOT, just early exit by returning Boolean-False
if (input-length < 4) {
# The quick check just now already confirmed it's palindrome
return Boolean-True
} else if (200 < input-length) {
# adjust this parameter to your preferences
#
# e.g. make it 20 for longer than 8000 etc
# or make it scale to input size,
# either logarithmically, or a fixed ratio like 2.5%
#
reverse last ( N = 4 ) characters/bytes of the input
if that **DOES NOT** match first N chars/bytes {
return boolean-false # early exit
# no point to reverse rest of it
# when head and tail don't even match
} else {
if N was substantial
trim out the head and tail of the input
you've confirmed; avoid duplicated work
remember to also update the variable(s)
you've elected to track the input size
}
[optional 1 : if that substring of N characters you've
just checked happened to all contain the
same character, let's call it C,
then locate the index position, P, for the first
character that isn't C
if P == input-size
then you've already proven
the entire string is a nonstop repeat
of one single character, which, by def,
must be a palindrome
then just return Boolean-True
but the P is more than the half-way point,
you've also proven it cannot possibly be a
palindrome, because second half contains a
component that doesn't exist in first half,
then just return Boolean-False ]
[optional 2 : for extremely long inputs,
like over 200,000 chars,
take the N chars right at the middle of it,
and see if the reversed one matches
if that fails, then do early exit and save time ]
}
if all pre-checks passed,
then simply do it BAU style :
reverse second-half of it,
and see if it's same as first half

With Python, short code can be faster since it puts the load into the faster internals of the VM (And there is the whole cache and other such things)
def ispalin(x):
return all(x[a]==x[-a-1] for a in xrange(len(x)>>1))

You can use a hashtable to put the character and have a counter variable whose value increases everytime you find an element not in table/map. If u check and find element thats already in table decrease the count.
For odd lettered string the counter should be back to 1 and for even it should hit 0.I hope this approach is right.
See below the snippet.
s->refers to string
eg: String s="abbcaddc";
Hashtable<Character,Integer> textMap= new Hashtable<Character,Integer>();
char charA[]= s.toCharArray();
for(int i=0;i<charA.length;i++)
{
if(!textMap.containsKey(charA[i]))
{
textMap.put(charA[i], ++count);
}
else
{
textMap.put(charA[i],--count);
}
if(length%2 !=0)
{
if(count == 1)
System.out.println("(odd case:PALINDROME)");
else
System.out.println("(odd case:not palindrome)");
}
else if(length%2==0)
{
if(count ==0)
System.out.println("(even case:palindrome)");
else
System.out.println("(even case :not palindrome)");
}

public class Palindrome{
private static boolean isPalindrome(String s){
if(s == null)
return false; //unitialized String ? return false
if(s.isEmpty()) //Empty Strings is a Palindrome
return true;
//we want check characters on opposite sides of the string
//and stop in the middle <divide and conquer>
int left = 0; //left-most char
int right = s.length() - 1; //right-most char
while(left < right){ //this elegantly handles odd characters string
if(s.charAt(left) != s.charAt(right)) //left char must equal
return false; //right else its not a palindrome
left++; //converge left to right
right--;//converge right to left
}
return true; // return true if the while doesn't exit
}
}
though we are doing n/2 calculations its still O(n)
this can done also using threads, but calculations get messy, best to avoid it. this doesn't test for special characters and is case sensitive. I have code that does it, but this code can be modified to handle that easily.

Related

How to instruct longest palindrome from a list of numbers?

I am trying to solve a question which says that we need to write a function in which given a list of numbers, we need to find the longest palindrome that we can from given only the numbers in the list.
For eg:
If the given list is : [3,47,6,6,5,6,15,22,1,6,15]
The longest palindrome that we can return is one of length 9, such as [6,15,6,3,47,3,6,15,6].
Additionally, we have the following constraints:
One can only use an array queue, array stack, and a chaining hashmap, and the list we are supposed to return, and the function must run in linear time. And we can use only constant additional space.
My approach was the following:
Since a palindrome can be formed if have an even number of certain characters, we can iterate over all the elements in the list, and store in a chaining hash map, the number of times each number appears in the list. This should take O(N) time since each lookup in the chaining hash map takes constant time, and iterating over the list takes linear time.
Then we can iterate over all the numbers in the chaining hash map, to see which numbers appear an even number of times, and accordingly, just make a palindrome. In the worst case, this will take a O(n) linear time.
Now there are two things I am wondering:
How should I make the actual palindrome? Like how do I use the data structures that I am being allowed to use in order to make a palindrome? I am thinking that since the queue is a LIFO data structure, for each number that occurs an even number of times, we add it once to the queue and once to the stack, and so on and so forth. And finally, we can just dequeue everything from the queue, and pop once from the stack, and then add it to the list!
It seems that with my approach, it is taking me two linear runs to solve the question. I am wondering if there is a faster way to do this.
Any and all help will be appreciated. Thanks!
It is not possible to get a better algorithm than one that is O(n), as every number in the input has to be inspected, as it might provide a possibility for a longer palindrome. If indeed the output must be a longest palindrome itself (and not only its length), then producing that output itself represents O(n).
You have also omitted one additional thing you have to do in your algorithm: there can be one value in the final palindrome that occurs only once (in the centre). So whenever you encounter a value that occurs an odd number of times, you may reserve one occurrence of that value for putting in the middle of an odd-length palindrome. The even remainder of the occurrences can be used as usual.
As to your questions:
How should I make the actual palindrome?
There are many ways to do it. But don't forget that if you have an even number of occurrences, you should use all those occurrences, not just two. So add half of them to the queue and half of them to the stack. When the frequency is odd, then still do the same (rounded down), and log the number also as a potential centre value.
When you have done this for all values, then dump the queue and stack together in the result list as you suggested, but don't forget to put the centre value in between the two, if you identified such a centre value (i.e. when not all occurrences were even).
It seems that with my approach, it is taking me two linear runs to solve the question.
You cannot do this better than with a linear time complexity. You can save a bit of time if you use the stack also for the result, and just dump the queue unto the stack (after potentially pushing the centre value).
I've got a solution when its palindrome only for the number and not the digit.
for the input: [51,15]
we will return [15] || [51] and not [51,15] =>(5,1,1,5);
feature more your example as a problem 3 doesn't appear twice(and appears in the answer)
or maybe I didn't understand the question.
public static int[] polidrom(int [] numbers){
HashMap<Integer/*the numbere*/,Integer/*how many time appeared*/> hash = new HashMap<>();
boolean middleFree= false;
int middleNumber = 0;
int space = 0;
Stack<Integer> stack = new Stack<>();
for (Integer num:numbers) {//count how mant times each digit appears
if(hash.containsKey(num)){hash.replace(num,1+hash.get(num));}
else{hash.put(num,1);}
}
for (Integer num:hash.keySet()) {//how many times i can use pairs
int original =hash.get(num);
int insert = (int)Math.floor(hash.get(num)/2);
if(hash.get(num) % 2 !=0){middleNumber = num;middleFree = true;}//as no copy
hash.replace(num,insert);
if(insert != 0){
for(int i =0; i < original;i++){
stack.push(num);
}
}
}
space = stack.size();
if(space == numbers.length){ space--;};//all the numbers are been used
int [] answer = new int[space+1];//the middle dont have to have an pair
int startPointer =0 , endPointer= space;
while (!stack.isEmpty()){
int num = stack.pop();
answer[startPointer] = num;
answer[endPointer] = num;
startPointer++;
endPointer--;
}
if (middleFree){answer[answer.length/2] = middleNumber;}
return answer;
}
space O(n) => {stack , hashMap , answer Array};
complexity: O(n)
You can skip the part where I used the stack and build the answer array in the same loop.
and I can't think of a way where you will not iterate at least twice;
Hope I've helped

If a word is made up of two valid words

Given a dictionary find out if given word can be made by two words in dictionary. For eg. given "newspaper" you have to find if it can be made by two words. (news and paper in this case). Only thing i can think of is starting from beginning and checking if current string is a word. In this case checking n, ne, new, news..... check for the remaining part if current string is a valid word.
Also how do you generalize it for k(means if a word is made up of k words) ? Any thoughts?
Starting your split at the center may yield results faster. For example, for newspaper, you would first try splitting at 'news paper' or 'newsp aper'. As you can see, for this example, you would find your result on the first or second try. If you do not find a result, just search outwards. See the example for 'crossbow' below:
cros sbow
cro ssbow
cross bow
For the case with two words, the problem can be solved by just considering all possible ways of splitting the word into two, then checking each half to see if it's a valid word. If the input string has length n, then there are only O(n) different ways of splitting the string. If you store the strings in a structure supporting fast lookup (say, a trie, or hash table).
The more interesting case is when you have k > 2 words to split the word into. For this, we can use a really elegant recursive formulation:
A word can be split into k words if it can be split into a word followed by a word splittable into k - 1 words.
The recursive base case would be that a word can be split into zero words only if it's the empty string, which is trivially true.
To use this recursive insight, we'll modify the original algorithm by considering all possible splits of the word into two parts. Once we have that split, we can check if the first part of the split is a word and if the second part of the split can be broken apart into k - 1 words. As an optimization, we don't recurse on all possible splits, but rather just on those where we know the first word is valid. Here's some sample code written in Java:
public static boolean isSplittable(String word, int k, Set<String> dictionary) {
/* Base case: If the string is empty, we can only split into k words and vice-
* versa.
*/
if (word.isEmpty() || k == 0)
return word.isEmpty() && k == 0;
/* Generate all possible non-empty splits of the word into two parts, recursing on
* problems where the first word is known to be valid.
*
* This loop is structured so that we always try pulling off at least one letter
* from the input string so that we don't try splitting the word into k pieces
* of which some are empty.
*/
for (int i = 1; i <= word.length(); ++i) {
String first = word.substring(0, i), last = word.substring(i);
if (dictionary.contains(first) &&
isSplittable(last, k - 1, dictionary)
return true;
}
/* If we're here, then no possible split works in this case and we should signal
* that no solution exists.
*/
return false;
}
}
This code, in the worst case, runs in time O(nk) because it tries to generate all possible partitions of the string into k different pieces. Of course, it's unlikely to hit this worst-case behavior because most possible splits won't end up forming any words.
I'd first loop through the dictionary using a strpos(-like) function to check if it occurs at all. Then try if you can find a match with the results.
So it would do something like this:
Loop through the dictionary strpos-ing every word in the dictionary and saving results into an array, let's say it gives me the results 'new', 'paper', and 'news'.
Check if new+paper==newspaper, check if new+news==newspaper, etc, untill you get to paper+news==newspaper which returns.
Not sure if it is a good method though, but it seems more efficient than checking a word letter for letter (more iterations) and you didn't explain how you'd check when the second word started.
Don't know what you mean by 'how do you generalize it for k'.

What is a typical algorithm for finding a string within a string?

I recently had an interview question that went something like this:
Given a large string (haystack), find a substring (needle)?
I was a little stumped to come up with a decent solution.
What is the best way to approach this that doesn't have a poor time complexity?
I like the Boyer-Moore algorithm. It's especially fun to implement when you have a lot of needles to find in a haystack (for example, probable-spam patterns in an email corpus).
You can use the Knuth-Morris-Pratt algorithm, which is O(n+m) where n is the length of the "haystack" string and m is the length of the search string.
The general problem is string searching; there are many algorithms and approaches depending on the nature of the application.
Knuth-Morris-Pratt, searches for a string within another string
Boyer-Moore, another string within string search
Aho-Corasick; searches for words from a reference dictionary in a given arbitrary text
Some advanced index data structures are also used in other applications. Suffix trees are used a lot in bioinformatics; here you have one long reference text, and then you have many arbitrary strings that you want to find all occurrences for. Once the index (i.e. the tree) is built, finding patterns can be done quite efficiently.
For an interview answer, I think it's better to show breadth as well. Knowing about all these different algorithms and what specific purposes they serve best is probably better than knowing just one algorithm by heart.
I do not believe that there is anything better then looking at the haystack one character at a time and try to match them against the needle.
This will have linear time complexity (against the length of the haystack). It could degrade if the needle is near the same length as the haystack and shares a long common and repeated prefix.
A typical algorithm would be as follows, with string indexes ranging from 0 to M-1.
It returns either the position of the match or -1 if not found.
foreach hpos in range 0 thru len(haystack)-len(needle):
found = true
foreach npos in range 0 thru len(needle):
if haystack[hpos+npos] != needle[npos]:
found = false
break
if found:
return hpos
return -1
It has a reasonable performance since it only checks as many characters in each haystack position as needed to discover there's no chance of a match.
It's not necessarily the most efficient algorithm but if your interviewer expects you to know all the high performance algorithms off the top of your head, they're being unrealistic (i.e., fools). A good developer knows the basics well, and how to find out the advanced stuff where necessary (e.g., when there's a performance problem, not before).
The performance ranges between O(a) and O(ab) depending on the actual data in the strings, where a and b are the haystack and needle lengths respectively.
One possible improvement is to store, within the npos loop, the first location greater than hpos, where the character matches the first character of the needle.
That way you can skip hpos forward in the next iteration since you know there can be no possible match before that point. But again, that may not be necessary, depending on your performance requirements. You should work out the balance between speed and maintainability yourself.
This problem is discussed in Hacking a Google Interview Practice Questions – Person A. Their sample solution:
bool hasSubstring(const char *str, const char *find) {
if (str[0] == '\0' && find[0] == '\0')
return true;
for(int i = 0; str[i] != '\0'; i++) {
bool foundNonMatch = false;
for(int j = 0; find[j] != '\0'; j++) {
if (str[i + j] != find[j]) {
foundNonMatch = true;
break;
}
}
if (!foundNonMatch)
return true;
}
return false;
}
Here is a presentation of a few algorithm (shamelessly linked from Humboldt University). contains a few good algorithms such as Boyer More and Z-box.
I did use the Z-Box algorithm, found it to work well and was more efficient than Boyer-More, however needs some time to wrap your head around it.
the easiest way I think is "haystack".Contains("needle") ;)
just fun,dont take it seriously.I think already you have got your answer.

Fastest way to find minimal Hamming distance to any substring?

Given a long string L and a shorter string S (the constraint is that L.length must be >= S.length), I want to find the minimum Hamming distance between S and any substring of L with length equal to S.length. Let's call the function for this minHamming(). For example,
minHamming(ABCDEFGHIJ, CDEFGG) == 1.
minHamming(ABCDEFGHIJ, BCDGHI) == 3.
Doing this the obvious way (enumerating every substring of L) requires O(S.length * L.length) time. Is there any clever way to do this in sublinear time? I search the same L with several different S strings, so doing some complicated preprocessing to L once is acceptable.
Edit: The modified Boyer-Moore would be a good idea, except that my alphabet is only 4 letters (DNA).
Perhaps surprisingly, this exact problem can be solved in just O(|A|nlog n) time using Fast Fourier Transforms (FFTs), where n is the length of the larger sequence L and |A| is the size of the alphabet.
Here is a freely available PDF of a paper by Donald Benson describing how it works:
Fourier methods for biosequence analysis (Donald Benson, Nucleic Acids Research 1990 vol. 18, pp. 3001-3006)
Summary: Convert each of your strings S and L into several indicator vectors (one per character, so 4 in the case of DNA), and then convolve corresponding vectors to determine match counts for each possible alignment. The trick is that convolution in the "time" domain, which ordinarily requires O(n^2) time, can be implemented using multiplication in the "frequency" domain, which requires just O(n) time, plus the time required to convert between domains and back again. Using the FFT each conversion takes just O(nlog n) time, so the overall time complexity is O(|A|nlog n). For greatest speed, finite field FFTs are used, which require only integer arithmetic.
Note: For arbitrary S and L this algorithm is clearly a huge performance win over the straightforward O(mn) algorithm as |S| and |L| become large, but OTOH if S is typically shorter than log|L| (e.g. when querying a large DB with a small sequence), then obviously this approach provides no speedup.
UPDATE 21/7/2009: Updated to mention that the time complexity also depends linearly on the size of the alphabet, since a separate pair of indicator vectors must be used for each character in the alphabet.
Modified Boyer-Moore
I've just dug up some old Python implementation of Boyer-Moore I had lying around and modified the matching loop (where the text is compared to the pattern). Instead of breaking out as soon as the first mismatch is found between the two strings, simply count up the number of mismatches, but remember the first mismatch:
current_dist = 0
while pattern_pos >= 0:
if pattern[pattern_pos] != text[text_pos]:
if first_mismatch == -1:
first_mismatch = pattern_pos
tp = text_pos
current_dist += 1
if current_dist == smallest_dist:
break
pattern_pos -= 1
text_pos -= 1
smallest_dist = min(current_dist, smallest_dist)
# if the distance is 0, we've had a match and can quit
if current_dist == 0:
return 0
else: # shift
pattern_pos = first_mismatch
text_pos = tp
...
If the string did not match completely at this point, go back to the point of the first mismatch by restoring the values. This makes sure that the smallest distance is actually found.
The whole implementation is rather long (~150LOC), but I can post it on request. The core idea is outlined above, everything else is standard Boyer-Moore.
Preprocessing on the Text
Another way to speed things up is preprocessing the text to have an index on character positions. You only want to start comparing at positions where at least a single match between the two strings occurs, otherwise the Hamming distance is |S| trivially.
import sys
from collections import defaultdict
import bisect
def char_positions(t):
pos = defaultdict(list)
for idx, c in enumerate(t):
pos[c].append(idx)
return dict(pos)
This method simply creates a dictionary which maps each character in the text to the sorted list of its occurrences.
The comparison loop is more or less unchanged to naive O(mn) approach, apart from the fact that we do not increase the position at which comparison is started by 1 each time, but based on the character positions:
def min_hamming(text, pattern):
best = len(pattern)
pos = char_positions(text)
i = find_next_pos(pattern, pos, 0)
while i < len(text) - len(pattern):
dist = 0
for c in range(len(pattern)):
if text[i+c] != pattern[c]:
dist += 1
if dist == best:
break
c += 1
else:
if dist == 0:
return 0
best = min(dist, best)
i = find_next_pos(pattern, pos, i + 1)
return best
The actual improvement is in find_next_pos:
def find_next_pos(pattern, pos, i):
smallest = sys.maxint
for idx, c in enumerate(pattern):
if c in pos:
x = bisect.bisect_left(pos[c], i + idx)
if x < len(pos[c]):
smallest = min(smallest, pos[c][x] - idx)
return smallest
For each new position, we find the lowest index at which a character from S occurs in L. If there is no such index any more, the algorithm will terminate.
find_next_pos is certainly complex, and one could try to improve it by only using the first several characters of the pattern S, or use a set to make sure characters from the pattern are not checked twice.
Discussion
Which method is faster largely depends on your dataset. The more diverse your alphabet is, the larger will be the jumps. If you have a very long L, the second method with preprocessing might be faster. For very, very short strings (like in your question), the naive approach will certainly be the fastest.
DNA
If you have a very small alphabet, you could try to get the character positions for character bigrams (or larger) rather than unigrams.
You're stuck as far as big-O is concerned.. At a fundamental level, you're going to need to test if every letter in the target matches each eligible letter in the substring.
Luckily, this is easily parallelized.
One optimization you can apply is to keep a running count of mismatches for the current position. If it's greater than the lowest hamming distance so far, then obviously you can skip to the next possibility.

In-Place Radix Sort

This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?
Preliminary
I've got a huge number of small fixed-length strings that only use the letters “A”, “C”, “G” and “T” (yes, you've guessed it: DNA) that I want to sort.
At the moment, I use std::sort which uses introsort in all common implementations of the STL. This works quite well. However, I'm convinced that radix sort fits my problem set perfectly and should work much better in practice.
Details
I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (N > 5,000,000).
The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.
Use Cases
Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character “N”), or even DNA with IUPAC ambiguity codes (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.
Research
Unfortunately, the Wikipedia article on radix sort is useless. The section about an in-place variant is complete rubbish. The NIST-DADS section on radix sort is next to nonexistent. There's a promising-sounding paper called Efficient Adaptive In-Place Radix Sorting which describes the algorithm “MSL”. Unfortunately, this paper, too, is disappointing.
In particular, there are the following things.
First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn’t detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions dest_group and dest_address without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least dest_address isn’t trivial).
Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn’t belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).
Another paper by one of the authors gives no accurate description at all, but it gives MSL’s runtime as sub-linear which is flat out wrong.
To recap: Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?
Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires 2 * seq.length passes through the array.
void radixSort(string[] seqs, size_t base = 0) {
if(seqs.length == 0)
return;
size_t TPos = seqs.length, APos = 0;
size_t i = 0;
while(i < TPos) {
if(seqs[i][base] == 'A') {
swap(seqs[i], seqs[APos++]);
i++;
}
else if(seqs[i][base] == 'T') {
swap(seqs[i], seqs[--TPos]);
} else i++;
}
i = APos;
size_t CPos = APos;
while(i < TPos) {
if(seqs[i][base] == 'C') {
swap(seqs[i], seqs[CPos++]);
}
i++;
}
if(base < seqs[0].length - 1) {
radixSort(seqs[0..APos], base + 1);
radixSort(seqs[APos..CPos], base + 1);
radixSort(seqs[CPos..TPos], base + 1);
radixSort(seqs[TPos..seqs.length], base + 1);
}
}
Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.
Edit:
I got curious whether this code actually works, so I tested/debugged it while waiting for my own bioinformatics code to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimized introsort.
I've never seen an in-place radix sort, and from the nature of the radix-sort I doubt that it is much faster than a out of place sort as long as the temporary array fits into memory.
Reason:
The sorting does a linear read on the input array, but all writes will be nearly random. From a certain N upwards this boils down to a cache miss per write. This cache miss is what slows down your algorithm. If it's in place or not will not change this effect.
I know that this will not answer your question directly, but if sorting is a bottleneck you may want to have a look at near sorting algorithms as a preprocessing step (the wiki-page on the soft-heap may get you started).
That could give a very nice cache locality boost. A text-book out-of-place radix sort will then perform better. The writes will still be nearly random but at least they will cluster around the same chunks of memory and as such increase the cache hit ratio.
I have no idea if it works out in practice though.
Btw: If you're dealing with DNA strings only: You can compress a char into two bits and pack your data quite a lot. This will cut down the memory requirement by factor four over a naiive representation. Addressing becomes more complex, but the ALU of your CPU has lots of time to spend during all the cache-misses anyway.
You can certainly drop the memory requirements by encoding the sequence in bits.
You are looking at permutations so, for length 2, with "ACGT" that's 16 states, or 4 bits.
For length 3, that's 64 states, which can be encoded in 6 bits. So it looks like 2 bits for each letter in the sequence, or about 32 bits for 16 characters like you said.
If there is a way to reduce the number of valid 'words', further compression may be possible.
So for sequences of length 3, one could create 64 buckets, maybe sized uint32, or uint64.
Initialize them to zero.
Iterate through your very very large list of 3 char sequences, and encode them as above.
Use this as a subscript, and increment that bucket.
Repeat this until all of your sequences have been processed.
Next, regenerate your list.
Iterate through the 64 buckets in order, for the count found in that bucket, generate that many instances of the sequence represented by that bucket.
when all of the buckets have been iterated, you have your sorted array.
A sequence of 4, adds 2 bits, so there would be 256 buckets.
A sequence of 5, adds 2 bits, so there would be 1024 buckets.
At some point the number of buckets will approach your limits.
If you read the sequences from a file, instead of keeping them in memory, more memory would be available for buckets.
I think this would be faster than doing the sort in situ as the buckets are likely to fit within your working set.
Here is a hack that shows the technique
#include <iostream>
#include <iomanip>
#include <math.h>
using namespace std;
const int width = 3;
const int bucketCount = exp(width * log(4)) + 1;
int *bucket = NULL;
const char charMap[4] = {'A', 'C', 'G', 'T'};
void setup
(
void
)
{
bucket = new int[bucketCount];
memset(bucket, '\0', bucketCount * sizeof(bucket[0]));
}
void teardown
(
void
)
{
delete[] bucket;
}
void show
(
int encoded
)
{
int z;
int y;
int j;
for (z = width - 1; z >= 0; z--)
{
int n = 1;
for (y = 0; y < z; y++)
n *= 4;
j = encoded % n;
encoded -= j;
encoded /= n;
cout << charMap[encoded];
encoded = j;
}
cout << endl;
}
int main(void)
{
// Sort this sequence
const char *testSequence = "CAGCCCAAAGGGTTTAGACTTGGTGCGCAGCAGTTAAGATTGTTT";
size_t testSequenceLength = strlen(testSequence);
setup();
// load the sequences into the buckets
size_t z;
for (z = 0; z < testSequenceLength; z += width)
{
int encoding = 0;
size_t y;
for (y = 0; y < width; y++)
{
encoding *= 4;
switch (*(testSequence + z + y))
{
case 'A' : encoding += 0; break;
case 'C' : encoding += 1; break;
case 'G' : encoding += 2; break;
case 'T' : encoding += 3; break;
default : abort();
};
}
bucket[encoding]++;
}
/* show the sorted sequences */
for (z = 0; z < bucketCount; z++)
{
while (bucket[z] > 0)
{
show(z);
bucket[z]--;
}
}
teardown();
return 0;
}
If your data set is so big, then I would think that a disk-based buffer approach would be best:
sort(List<string> elements, int prefix)
if (elements.Count < THRESHOLD)
return InMemoryRadixSort(elements, prefix)
else
return DiskBackedRadixSort(elements, prefix)
DiskBackedRadixSort(elements, prefix)
DiskBackedBuffer<string>[] buckets
foreach (element in elements)
buckets[element.MSB(prefix)].Add(element);
List<string> ret
foreach (bucket in buckets)
ret.Add(sort(bucket, prefix + 1))
return ret
I would also experiment grouping into a larger number of buckets, for instance, if your string was:
GATTACA
the first MSB call would return the bucket for GATT (256 total buckets), that way you make fewer branches of the disk based buffer. This may or may not improve performance, so experiment with it.
I'm going to go out on a limb and suggest you switch to a heap/heapsort implementation. This suggestion comes with some assumptions:
You control the reading of the data
You can do something meaningful with the sorted data as soon as you 'start' getting it sorted.
The beauty of the heap/heap-sort is that you can build the heap while you read the data, and you can start getting results the moment you have built the heap.
Let's step back. If you are so fortunate that you can read the data asynchronously (that is, you can post some kind of read request and be notified when some data is ready), and then you can build a chunk of the heap while you are waiting for the next chunk of data to come in - even from disk. Often, this approach can bury most of the cost of half of your sorting behind the time spent getting the data.
Once you have the data read, the first element is already available. Depending on where you are sending the data, this can be great. If you are sending it to another asynchronous reader, or some parallel 'event' model, or UI, you can send chunks and chunks as you go.
That said - if you have no control over how the data is read, and it is read synchronously, and you have no use for the sorted data until it is entirely written out - ignore all this. :(
See the Wikipedia articles:
Heapsort
Binary heap
"Radix sorting with no extra space" is a paper addressing your problem.
Performance-wise you might want to look at a more general string-comparison sorting algorithms.
Currently you wind up touching every element of every string, but you can do better!
In particular, a burst sort is a very good fit for this case. As a bonus, since burstsort is based on tries, it works ridiculously well for the small alphabet sizes used in DNA/RNA, since you don't need to build any sort of ternary search node, hash or other trie node compression scheme into the trie implementation. The tries may be useful for your suffix-array-like final goal as well.
A decent general purpose implementation of burstsort is available on source forge at http://sourceforge.net/projects/burstsort/ - but it is not in-place.
For comparison purposes, The C-burstsort implementation covered at http://www.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf benchmarks 4-5x faster than quicksort and radix sorts for some typical workloads.
You'll want to take a look at Large-scale Genome Sequence Processing by Drs. Kasahara and Morishita.
Strings comprised of the four nucleotide letters A, C, G, and T can be specially encoded into Integers for much faster processing. Radix sort is among many algorithms discussed in the book; you should be able to adapt the accepted answer to this question and see a big performance improvement.
You might try using a trie. Sorting the data is simply iterating through the dataset and inserting it; the structure is naturally sorted, and you can think of it as similar to a B-Tree (except instead of making comparisons, you always use pointer indirections).
Caching behavior will favor all of the internal nodes, so you probably won't improve upon that; but you can fiddle with the branching factor of your trie as well (ensure that every node fits into a single cache line, allocate trie nodes similar to a heap, as a contiguous array that represents a level-order traversal). Since tries are also digital structures (O(k) insert/find/delete for elements of length k), you should have competitive performance to a radix sort.
I would burstsort a packed-bit representation of the strings. Burstsort is claimed to have much better locality than radix sorts, keeping the extra space usage down with burst tries in place of classical tries. The original paper has measurements.
It looks like you've solved the problem, but for the record, it appears that one version of a workable in-place radix sort is the "American Flag Sort". It's described here: Engineering Radix Sort. The general idea is to do 2 passes on each character - first count how many of each you have, so you can subdivide the input array into bins. Then go through again, swapping each element into the correct bin. Now recursively sort each bin on the next character position.
Radix-Sort is not cache conscious and is not the fastest sort algorithm for large sets.
You can look at:
ti7qsort. ti7qsort is the fastest sort for integers (can be used for small-fixed size strings).
Inline QSORT
String sorting
You can also use compression and encode each letter of your DNA into 2 bits before storing into the sort array.
dsimcha's MSB radix sort looks nice, but Nils gets closer to the heart of the problem with the observation that cache locality is what's killing you at large problem sizes.
I suggest a very simple approach:
Empirically estimate the largest size m for which a radix sort is efficient.
Read blocks of m elements at a time, radix sort them, and write them out (to a memory buffer if you have enough memory, but otherwise to file), until you exhaust your input.
Mergesort the resulting sorted blocks.
Mergesort is the most cache-friendly sorting algorithm I'm aware of: "Read the next item from either array A or B, then write an item to the output buffer." It runs efficiently on tape drives. It does require 2n space to sort n items, but my bet is that the much-improved cache locality you'll see will make that unimportant -- and if you were using a non-in-place radix sort, you needed that extra space anyway.
Please note finally that mergesort can be implemented without recursion, and in fact doing it this way makes clear the true linear memory access pattern.
First, think about the coding of your problem. Get rid of the strings, replace them by a binary representation. Use the first byte to indicate length+encoding. Alternatively, use a fixed length representation at a four-byte boundary. Then the radix sort becomes much easier. For a radix sort, the most important thing is to not have exception handling at the hot spot of the inner loop.
OK, I thought a bit more about the 4-nary problem. You want a solution like a Judy tree for this. The next solution can handle variable length strings; for fixed length just remove the length bits, that actually makes it easier.
Allocate blocks of 16 pointers. The least significant bit of the pointers can be reused, as your blocks will always be aligned. You might want a special storage allocator for it (breaking up large storage into smaller blocks). There are a number of different kinds of blocks:
Encoding with 7 length bits of variable-length strings. As they fill up, you replace them by:
Position encodes the next two characters, you have 16 pointers to the next blocks, ending with:
Bitmap encoding of the last three characters of a string.
For each kind of block, you need to store different information in the LSBs. As you have variable length strings you need to store end-of-string too, and the last kind of block can only be used for the longest strings. The 7 length bits should be replaced by less as you get deeper into the structure.
This provides you with a reasonably fast and very memory efficient storage of sorted strings. It will behave somewhat like a trie. To get this working, make sure to build enough unit tests. You want coverage of all block transitions. You want to start with only the second kind of block.
For even more performance, you might want to add different block types and a larger size of block. If the blocks are always the same size and large enough, you can use even fewer bits for the pointers. With a block size of 16 pointers, you already have a byte free in a 32-bit address space. Take a look at the Judy tree documentation for interesting block types. Basically, you add code and engineering time for a space (and runtime) trade-off
You probably want to start with a 256 wide direct radix for the first four characters. That provides a decent space/time tradeoff. In this implementation, you get much less memory overhead than with a simple trie; it is approximately three times smaller (I haven't measured). O(n) is no problem if the constant is low enough, as you noticed when comparing with the O(n log n) quicksort.
Are you interested in handling doubles? With short sequences, there are going to be. Adapting the blocks to handle counts is tricky, but it can be very space-efficient.
While the accepted answer perfectly answers the description of the problem, I've reached this place looking in vain for an algorithm to partition inline an array into N parts. I've written one myself, so here it is.
Warning: this is not a stable partitioning algorithm, so for multilevel partitioning, one must repartition each resulting partition instead of the whole array. The advantage is that it is inline.
The way it helps with the question posed is that you can repeatedly partition inline based on a letter of the string, then sort the partitions when they are small enough with the algorithm of your choice.
function partitionInPlace(input, partitionFunction, numPartitions, startIndex=0, endIndex=-1) {
if (endIndex===-1) endIndex=input.length;
const starts = Array.from({ length: numPartitions + 1 }, () => 0);
for (let i = startIndex; i < endIndex; i++) {
const val = input[i];
const partByte = partitionFunction(val);
starts[partByte]++;
}
let prev = startIndex;
for (let i = 0; i < numPartitions; i++) {
const p = prev;
prev += starts[i];
starts[i] = p;
}
const indexes = [...starts];
starts[numPartitions] = prev;
let bucket = 0;
while (bucket < numPartitions) {
const start = starts[bucket];
const end = starts[bucket + 1];
if (end - start < 1) {
bucket++;
continue;
}
let index = indexes[bucket];
if (index === end) {
bucket++;
continue;
}
let val = input[index];
let destBucket = partitionFunction(val);
if (destBucket === bucket) {
indexes[bucket] = index + 1;
continue;
}
let dest;
do {
dest = indexes[destBucket] - 1;
let destVal;
let destValBucket = destBucket;
while (destValBucket === destBucket) {
dest++;
destVal = input[dest];
destValBucket = partitionFunction(destVal);
}
input[dest] = val;
indexes[destBucket] = dest + 1;
val = destVal;
destBucket = destValBucket;
} while (dest !== index)
}
return starts;
}

Resources