I have a large set (100k) of short strings (not more than 100 chars) and I need to quickly find all those who have a certain substring.
This will be used as a search box where the user starts typing and the system immediately gives "suggestions" (the strings that have as a substring the text that the user typed). Something similar to the "Tag" box in StackOverflow.
As this will be interactive, it should be pretty fast. What algorithm or data structure do you recommend for this?
BTW, I'll be using Delphi 2007.
Thanks in advance.
I wrote out a long blurb, doing a bunch of complexity calculations and xzibit jokes (tree in a tree so you can lookup when you lookup), but then realized that this is easier than I thought. Browsers do this all the time and they never precompute big tables every time you're loading a page.
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
what it means is that you take your 100k strings and combine them into one long string. you then take your query substring, and iterate over your big string, looking for your matches. but you're not jumping by character (which would mean you're looking at 100k*100 iterations). you're jumping by the length of your substring, so the longer your substring gets, the faster this goes.
here's a great example: http://userweb.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-example.html
they're searching for the string EXAMPLE.
this is the kind of stuff browsers and text editors do, and they dont really build giant prefix tables every time you load a page.
The data structure you'll likely want to use is a Trie, specifically a suffix trie. Read this article for a good explanation of how they work and how to write one for your problem.
While you certainly can speed things up with a better data structure, this is a time that brute force might be perfectly adequate. Doing a quick test:
[Edit: changed code to search for substrings, edited again to shorten the substring it searches for compared to the ones it searches in.]
#include <algorithm>
#include <iostream>
#include <vector>
#include <string>
#include <time.h>
std::string rand_string(int min=20, int max=100) {
size_t length = rand()% (max-min) + min;
std::string ret;
for (size_t i=0; i<length; i++)
ret.push_back(rand() % ('z' - 'a') + 'a');
return ret;
}
class substr {
std::string seek;
public:
substr(std::string x) : seek(x) {}
bool operator()(std::string const &y) { return y.find(seek) != std::string::npos; }
};
int main() {
std::vector<std::string> values;
for (int i=0; i<100000; i++)
values.push_back(rand_string());
std::string seek = rand_string(5, 10);
const int reps = 10;
clock_t start = clock();
std::vector<std::string>::iterator pos;
for (int i=0; i<reps; i++)
pos = std::find_if(values.begin(), values.end(), substr(seek));
clock_t stop = clock();
std::cout << "Search took: " << double(stop-start)/CLOCKS_PER_SEC/reps << " seconds\n";
if (pos == values.end())
std::cout << "Value wasn't found\n";
else
std::cout << "Value was found\n";
return 0;
}
On my machine (around 4 years old -- hardly a speed demon by current standards) this runs in around 3 10 milliseconds per search. That's fast enough to appear instantaneous to an interactive user -- and with 10 times as many strings, it would still be fine.
I hate to disagree with Mike and his supporters, but suffix trees (data structure described in his link) are a lot of pain to implement. And finding reliable implementation in Pascal/Delphi might be difficult too.
Suffix Arrays offer the same functionality while being a lot simpler. The tradeoff is O(m * logn) complexity, where m is length of the search token and n is size of the dataset (100kb in this case).
In case somebody doesn't know, both suffix trees and suffix arrays allow you to find all occurrences of substring s in long text t.
Fernando's problem can be reduced to this one, by concatenating initial set of strings into one string using some delimiter symbol. For example, is initial set is ["text1", "text2", "some text"], then result string t will be "text1|text2|some text".
Now, instead of searching for string "text" in each word separately, we just need to find all occurrences of it in big string t.
I also recommend Oren's answer where he suggests another realistic approach.
This Delphi implementation of Boyer-Moore-Horspool might give you what you need.
Disclaimer: I did not try this code...
What you're probably looking for is an n-gram. It's used to find the most likely words related to your substring. Very interesting stuff, and though it may be overkill for what you're looking for, it's still good to know.
Related
I have a list of about 200 integers whose values are between 1 and 5.
I want to get into learning about sorting algorithms and knowing where to apply each because at the moment I use bubble-sort for everything which I've been told is a terrible way to do things.
What would be the fastest sorting algorithm for this integer sorting?
EDIT: It turns out that because I know the numbers are 1 to 5 then I can use a bucket sort (?) algorithm which if I'm not mistaken - and I definitely could be - means that for each integer of value 1, I put it in the 1 group, value 2 I put it in the 2 group etc, then concatenate the groups at the end. This seems like a simple and efficient way to do it.
However since this is (currently) a learning excercise for me I am going to remove the 1 - 5 limitation and try to implement bubble-sort and merge-sort then compare the two to see which is faster.
Thanks for your help!
... which I've been told is a terrible way to do things.
First off, don't accept as gospel anything you hear from random bods on the internet (even me).
Bubble sort is fine under certain conditions, such as when the data is already mostly sorted, or the item count is relatively small (such as 200) (a), or you have no sort functionality built into the language and you're on a tight deadline where lack of performance will annoy the customer but lack of functionality will get you fired :-)
This bias against bubble sort is similar to the "only one exit point from a function" and "no goto" rules. You should understand the reasoning behind them so that you know when the rules can be ignored safely.
Anyway, on to the question proper. An efficient way for your specific case is to just count the items then output them, something like:
dim count[1..5] = {0, 0, 0, 0, 0};
for each item in list:
count[item] = count[item] + 1
for val in 1..5:
for quant in 1..count[val]:
output val
That's an O(n) time and O(1) space solution and you won't find a more efficient big-O for a generalised sort routine - it's only possible in this case because of the extra information you have about the data (limited to the values 1 through 5).
If you wanted to examine all the different sort algorithms, the Wikipedia Sorting Algorithm page is a useful starting point, including the major algorithms and their properties.
(a) As an aside, the following code (using worst case data for bubble sort), when run under CygWin on a not-very-powerful IBM T60 (2GHz dual core) laptop, completes in, on average, 0.157 seconds (5 samples: 0.150, 0.125, 0.192, 0.199, 0.115).
I wouldn't use it for sorting a million items (everyone knows bubble sort scales poorly) but 200 should be fine in most cases:
#include <stdio.h>
#define COUNT 200
int main (void) {
int i, swapped, tmp, item[COUNT];
// Set up worst case (reverse order) data.
for (i = 0; i < COUNT; i++)
item[i] = 200 - i;
// Slightly optimised bubble sort.
swapped = 1;
while (swapped) {
swapped = 0;
for (i = 1; i < COUNT; i++) {
if (item[i-1] > item[i]) {
tmp = item[i-1];
item[i-1] = item[i];
item[i] = tmp;
swapped = 1;
}
}
}
// for (i = 0; i < COUNT; i++)
// printf ("%d ", item[i]);
// putchar ('\n');
return 0;
}
You may not need sorting here, since you only have 5 possible values.
You could use 5 containers (or buckets) and as you scan your list of integers you place the values in the right bucket.
At the end, join the buckets together, in order.
Merge sort is an O(n log n) I think its way better than QuickSort
You can find some C# code here.
I recently had an interview question that went something like this:
Given a large string (haystack), find a substring (needle)?
I was a little stumped to come up with a decent solution.
What is the best way to approach this that doesn't have a poor time complexity?
I like the Boyer-Moore algorithm. It's especially fun to implement when you have a lot of needles to find in a haystack (for example, probable-spam patterns in an email corpus).
You can use the Knuth-Morris-Pratt algorithm, which is O(n+m) where n is the length of the "haystack" string and m is the length of the search string.
The general problem is string searching; there are many algorithms and approaches depending on the nature of the application.
Knuth-Morris-Pratt, searches for a string within another string
Boyer-Moore, another string within string search
Aho-Corasick; searches for words from a reference dictionary in a given arbitrary text
Some advanced index data structures are also used in other applications. Suffix trees are used a lot in bioinformatics; here you have one long reference text, and then you have many arbitrary strings that you want to find all occurrences for. Once the index (i.e. the tree) is built, finding patterns can be done quite efficiently.
For an interview answer, I think it's better to show breadth as well. Knowing about all these different algorithms and what specific purposes they serve best is probably better than knowing just one algorithm by heart.
I do not believe that there is anything better then looking at the haystack one character at a time and try to match them against the needle.
This will have linear time complexity (against the length of the haystack). It could degrade if the needle is near the same length as the haystack and shares a long common and repeated prefix.
A typical algorithm would be as follows, with string indexes ranging from 0 to M-1.
It returns either the position of the match or -1 if not found.
foreach hpos in range 0 thru len(haystack)-len(needle):
found = true
foreach npos in range 0 thru len(needle):
if haystack[hpos+npos] != needle[npos]:
found = false
break
if found:
return hpos
return -1
It has a reasonable performance since it only checks as many characters in each haystack position as needed to discover there's no chance of a match.
It's not necessarily the most efficient algorithm but if your interviewer expects you to know all the high performance algorithms off the top of your head, they're being unrealistic (i.e., fools). A good developer knows the basics well, and how to find out the advanced stuff where necessary (e.g., when there's a performance problem, not before).
The performance ranges between O(a) and O(ab) depending on the actual data in the strings, where a and b are the haystack and needle lengths respectively.
One possible improvement is to store, within the npos loop, the first location greater than hpos, where the character matches the first character of the needle.
That way you can skip hpos forward in the next iteration since you know there can be no possible match before that point. But again, that may not be necessary, depending on your performance requirements. You should work out the balance between speed and maintainability yourself.
This problem is discussed in Hacking a Google Interview Practice Questions – Person A. Their sample solution:
bool hasSubstring(const char *str, const char *find) {
if (str[0] == '\0' && find[0] == '\0')
return true;
for(int i = 0; str[i] != '\0'; i++) {
bool foundNonMatch = false;
for(int j = 0; find[j] != '\0'; j++) {
if (str[i + j] != find[j]) {
foundNonMatch = true;
break;
}
}
if (!foundNonMatch)
return true;
}
return false;
}
Here is a presentation of a few algorithm (shamelessly linked from Humboldt University). contains a few good algorithms such as Boyer More and Z-box.
I did use the Z-Box algorithm, found it to work well and was more efficient than Boyer-More, however needs some time to wrap your head around it.
the easiest way I think is "haystack".Contains("needle") ;)
just fun,dont take it seriously.I think already you have got your answer.
How can I generate all perfect numbers between 1 and 100?
A perfect number is a positive integer that is equal to the sum of its proper divisors. For example, 6(=1+2+3) is a perfect number.
So I suspect Frank is looking for an answer in Prolog, and yes it does smell rather homeworky...
For fun I decided to write up my answer. It took me about 50 lines.
So here is the outline of what my predicates look like. Maybe it will help get you thinking the Prolog way.
is_divisor(+Num,+Factor)
divisors(+Num,-Factors)
divisors(+Num,+N,-Factors)
sum(+List,-Total)
sum(+List,+Sofar,-Total)
is_perfect(+N)
perfect(+N,-List)
The + and - are not really part of the parameter names. They are a documentation clue about what the author expects to be instantiated.(NB) "+Foo" means you expect Foo to have a value when the predicate is called. "-Foo" means you expect to Foo to be a variable when the predicate is called, and give it a value by the time it finishes. (kind of like input and output, if it helps to think that way)
Whenever you see a pair of predicates like sum/2 and sum/3, odds are the sum/2 one is like a wrapper to the sum/3 one which is doing something like an accumulator.
I didn't bother to make it print them out nicely. You can just query it directly in the Prolog command line:
?- perfect(100,L).
L = [28, 6] ;
fail.
Another thing that might be helpful, that I find with Prolog predicates, is that there are generally two kinds. One is one that simply checks if something is true. For this kind of predicate, you want everything else to fail. These don't tend to need to be recursive.
Others will want to go through a range (of numbers or a list) and always return a result, even if it is 0 or []. For these types of predicates you need to use recursion and think about your base case.
HTH.
NB: This is called "mode", and you can actually specify them and the compiler/interpreter will enforce them, but I personally just use them in documentation. Also tried to find a page with info about Prolog mode, but I can't find a good link. :(
I'm not sure if this is what you were looking for, but you could always just print out "6, 28"...
Well looks like you need to loop up until n/2 that is 1/2 of n. Divide the number and if there is no remainder then you can include it in the total, once you have exhausted 1/2 of n then you check if your total added = the number you are testing.
For instance:
#include "stdafx.h"
#include "iostream"
#include "math.h"
using namespace std;
int main(void)
{
int total=0;
for(int i = 1; i<=100; i++)
{
for( int j=1; j<=i/2; j++)
{
if (!(i%j))
{
total+=j;
}
}
if (i==total)
{
cout << i << " is perfect";
}
//it works
total=0;
}
return 0;
}
Given a query string Q of length N, and a list L of M sequences of length exactly N, what is the most efficient algorithm to find the string in L with the fewest mismatch positions to Q? For example:
Q = "ABCDEFG";
L = ["ABCCEFG", "AAAAAAA", "TTAGGGT", "ZYXWVUT"];
answer = L.query(Q); # Returns "ABCCEFG"
answer2 = L.query("AAAATAA"); #Returns "AAAAAAA".
The obvious way is to scan every sequence in L, making the search take O(M * N). Is there any way to do this in sublinear time? I don't care if there's a large upfront cost to organizing L into some data structure because it will be queried a lot of times. Also, handling tied scores arbitrarily is fine.
Edit: To clarify, I am looking for the Hamming distance.
All the answers except the one that mentions the best first algorithm are very much off.
Locally sensitive hashing is basically dreaming. This is the first time I see answers so much off on stackoverflow.
First, this is a hard, but standard problem that has been solved many years ago
in different ways.
One approach uses a trie such as the one preseted
by Sedgewick here:
http://www.cs.princeton.edu/~rs/strings/
Sedgewick also has sample C code.
I quote from the paper titled "Fast Algorithms for Sorting and Searching Strings" by Bentley and Sedgewick:
"‘‘Near neighbor’’ queries locate all words within a given Hamming distance
of a query word (for instance, code is distance 2 from soda). We give a new algorithm for near neighbor searching in strings, present a simple C implementation, and describe experiments on its efficiency."
A second approach is to use indexing. Split the strings into characters n-grams and index
with inverted index (google for Lucene spell checker to see how it's done).
Use the index to pull potential candidates and then run hamming distance or edit distnace on the candidates. This is the approach guaranteed to work best (and relatively simple).
A third appears in the area of speech recognition. There the query is a wav signal, and the database is a set of strings. There is a "table" that matches pieces of the signal to pieces of words. The goal is to find the best match of words to signal. This problem is known as word alignment.
In the problem posted, there is an implicit cost of matching query parts to database parts.
For example one may have different costs for deletion/insertion/substitution and even
different costs for mismatching say "ph" with "f".
The standard solution in speech recognition uses a dynamic programming approach which is made efficient via heuristics that direct pruning. In this way, only the best, say 50 candidates are kept. Thus, the name best-first search. In theory, you may not get the best match, but usually one gets a good match.
Here is a reference to the latter approach:
http://amta2010.amtaweb.org/AMTA/papers/2-02-KoehnSenellart.pdf
Fast Approximate String Matching with Suffix Arrays and A* Parsing.
This approach applies not only to words but to sentences.
Locality sensitive hashing underlies what seems to be the asymptotically best method known, as I understand it from this review article in CACM. Said article is pretty hairy and I didn't read it all. See also nearest neighbor search.
To relate these references to your problem: they all deal with a set of points in a metric space, such as an n-dimensional vector space. In your problem, n is the length of each string, and the values on each coordinate are the characters that can appear at each position in a string.
The "best" method will vary significantly depending on your input set and query set. Having a fixed message length will let you treat this problem in a classification context.
An information theoretic decision tree algorithm (like C4.5, for example) will provide the best overall guarantee on performance. In order to get optimal performance out of this method, you must first cluster the string indices into features based on mutual information. Note that you will need to modify the classifier to return all leaf nodes at the last branch, then compute a partial edit distance for each of them. The edit distance only needs to be calculated for the feature set represented by the last split of the tree.
Using this technique, querying should be ~ O(k log n), k << m, where k is the expectation of the feature size, m is the length of the string, and n is the number of comparison sequences.
The initial setup on this is guaranteed to be less than O(m^2 + n*t^2), t < m, t * k ~ m, where t is the feature count for an item. This is very reasonable and should not require any serious hardware.
These very nice performance numbers are possible because of the fixed m constraint. Enjoy!
I think you are looking for the Levenshtein edit distance.
There are a few questions here on SO about this already, I suppose you can find some good answers.
You could treat each sequence as an N-dimensional coordinate, chunk the resulting space into blocks that know what sequences occur in them, then on a lookup first search the search sequence's block and all contiguous blocks, then expand outward as necessary. (Maintaining several scopes of chunking is probably more desirable than getting into searching really large groups of blocks.)
Are you looking for the Hamming distance between the strings (i.e. the number of different characters at equivalent locations)?
Or does the distance "between" characters (e.g. difference between ASCII values of English letters) matter to you as well?
Some variety of best-first search on the target sequences will do much better than O(M * N). The basic idea of this is that you'd compare the first character in your candidate sequence with the first character of the target sequences, then in your second iteration only do the next-character comparison with the sequences that have the least number of mismatches, and so on. In your first example, you'd wind up comparing against ABCCEFG and AAAAAAA the second time, ABCCEFG only the third and fourth times, all the sequences the fifth time, and only ABCCEFG thereafter. When you get to the end of your candidate sequence, the set of target sequences with the lowest mismatch count is your match set.
(Note: at each step you're comparing against the next character for that branch of the search. None of the progressive comparisons skip characters.)
I can't think of a general, exact algorithm which will be less than O(N * M), but if you have a small enough M and N you can make an algorithm which performs as (N + M) using bit-parallel operations.
For example, if N and M are both less than 16, you could use a N * M lookup table of 64 bit ints ( 16*log2(16) = 64), and perform all operations in one pass through the string, where each group of 4 bits in the counter counts 0-15 for one of the string being matched. Obviously you need M log2(N+1) bits to store the counters, so might need to update multiple values for each character, but often a single pass lookup can be faster than other approaches. So it's actually O( N * M log(N) ), just with a lower constant factor - using 64 bit ints introduces a 1/64 into it, so should be better if log2(N) < 64. If M log2(N+1) < 64, it works out as (N+M) operations. But that's still linear, rather than sub-linear.
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
size_t match ( const char* string, uint64_t table[][128] ) ;
int main ()
{
const char* data[] = { "ABCCEFG", "AAAAAAA", "TTAGGGT", "ZYXWVUT" };
const size_t N = 7;
const size_t M = 4;
// prepare a table
uint64_t table[7][128] = { 0 };
for ( size_t i = 0; i < M; ++i )
for ( size_t j = 0; j < N; ++j )
table[j][ (size_t)data[i][j] ] |= 1 << (i * 4);
const char* examples[] = { "ABCDEFG", "AAAATAA", "TTAGQQT", "ZAAGVUT" };
for ( size_t i = 0; i < 4; ++i ) {
const char* q = examples[i];
size_t result = match ( q, table );
printf("Q(%s) -> %zd %s\n", q, result, data[result]);
}
}
size_t match ( const char* string, uint64_t table[][128] )
{
uint64_t count = 0;
// scan through string once, updating all counters at once
for ( size_t i = 0; string[i]; ++i )
count += table[i][ (size_t) string[i] ];
// find greatest sub-count within count
size_t best = 0;
size_t best_sub_count = count & 0xf;
for ( size_t i = 1; i < 4; ++i ) {
size_t sub_count = ( count >>= 4 ) & 0xf;
if ( sub_count > best_sub_count ) {
best_sub_count = sub_count;
best = i;
}
}
return best;
}
Sorry for bumping this old thread
To search elementwise would mean a complexity of O(M*N*N) - O(M) for searching and O(N*N) for calculating levenshtein distance.
The OP is looking for an efficient way to find the smallest hamming distance (c), not the string itself. If you have an upper bound on c (say X), you can find the smallest c in O(log(X)*M*N).
As Stefan pointed out, you can quickly find strings within a given hamming distance. This page http://blog.faroo.com/2015/03/24/fast-approximate-string-matching-with-large-edit-distances/ talks about one such way using Tries. Modify this to just test if there is such a string and binary search on c from 0 to X.
If up front cost don't matter you could calculate the best match for every possible input, and put the result in a hash map.
Of course this won't work if N isn't exremely small.
This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?
Preliminary
I've got a huge number of small fixed-length strings that only use the letters “A”, “C”, “G” and “T” (yes, you've guessed it: DNA) that I want to sort.
At the moment, I use std::sort which uses introsort in all common implementations of the STL. This works quite well. However, I'm convinced that radix sort fits my problem set perfectly and should work much better in practice.
Details
I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (N > 5,000,000).
The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.
Use Cases
Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character “N”), or even DNA with IUPAC ambiguity codes (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.
Research
Unfortunately, the Wikipedia article on radix sort is useless. The section about an in-place variant is complete rubbish. The NIST-DADS section on radix sort is next to nonexistent. There's a promising-sounding paper called Efficient Adaptive In-Place Radix Sorting which describes the algorithm “MSL”. Unfortunately, this paper, too, is disappointing.
In particular, there are the following things.
First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn’t detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions dest_group and dest_address without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least dest_address isn’t trivial).
Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn’t belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).
Another paper by one of the authors gives no accurate description at all, but it gives MSL’s runtime as sub-linear which is flat out wrong.
To recap: Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?
Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires 2 * seq.length passes through the array.
void radixSort(string[] seqs, size_t base = 0) {
if(seqs.length == 0)
return;
size_t TPos = seqs.length, APos = 0;
size_t i = 0;
while(i < TPos) {
if(seqs[i][base] == 'A') {
swap(seqs[i], seqs[APos++]);
i++;
}
else if(seqs[i][base] == 'T') {
swap(seqs[i], seqs[--TPos]);
} else i++;
}
i = APos;
size_t CPos = APos;
while(i < TPos) {
if(seqs[i][base] == 'C') {
swap(seqs[i], seqs[CPos++]);
}
i++;
}
if(base < seqs[0].length - 1) {
radixSort(seqs[0..APos], base + 1);
radixSort(seqs[APos..CPos], base + 1);
radixSort(seqs[CPos..TPos], base + 1);
radixSort(seqs[TPos..seqs.length], base + 1);
}
}
Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.
Edit:
I got curious whether this code actually works, so I tested/debugged it while waiting for my own bioinformatics code to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimized introsort.
I've never seen an in-place radix sort, and from the nature of the radix-sort I doubt that it is much faster than a out of place sort as long as the temporary array fits into memory.
Reason:
The sorting does a linear read on the input array, but all writes will be nearly random. From a certain N upwards this boils down to a cache miss per write. This cache miss is what slows down your algorithm. If it's in place or not will not change this effect.
I know that this will not answer your question directly, but if sorting is a bottleneck you may want to have a look at near sorting algorithms as a preprocessing step (the wiki-page on the soft-heap may get you started).
That could give a very nice cache locality boost. A text-book out-of-place radix sort will then perform better. The writes will still be nearly random but at least they will cluster around the same chunks of memory and as such increase the cache hit ratio.
I have no idea if it works out in practice though.
Btw: If you're dealing with DNA strings only: You can compress a char into two bits and pack your data quite a lot. This will cut down the memory requirement by factor four over a naiive representation. Addressing becomes more complex, but the ALU of your CPU has lots of time to spend during all the cache-misses anyway.
You can certainly drop the memory requirements by encoding the sequence in bits.
You are looking at permutations so, for length 2, with "ACGT" that's 16 states, or 4 bits.
For length 3, that's 64 states, which can be encoded in 6 bits. So it looks like 2 bits for each letter in the sequence, or about 32 bits for 16 characters like you said.
If there is a way to reduce the number of valid 'words', further compression may be possible.
So for sequences of length 3, one could create 64 buckets, maybe sized uint32, or uint64.
Initialize them to zero.
Iterate through your very very large list of 3 char sequences, and encode them as above.
Use this as a subscript, and increment that bucket.
Repeat this until all of your sequences have been processed.
Next, regenerate your list.
Iterate through the 64 buckets in order, for the count found in that bucket, generate that many instances of the sequence represented by that bucket.
when all of the buckets have been iterated, you have your sorted array.
A sequence of 4, adds 2 bits, so there would be 256 buckets.
A sequence of 5, adds 2 bits, so there would be 1024 buckets.
At some point the number of buckets will approach your limits.
If you read the sequences from a file, instead of keeping them in memory, more memory would be available for buckets.
I think this would be faster than doing the sort in situ as the buckets are likely to fit within your working set.
Here is a hack that shows the technique
#include <iostream>
#include <iomanip>
#include <math.h>
using namespace std;
const int width = 3;
const int bucketCount = exp(width * log(4)) + 1;
int *bucket = NULL;
const char charMap[4] = {'A', 'C', 'G', 'T'};
void setup
(
void
)
{
bucket = new int[bucketCount];
memset(bucket, '\0', bucketCount * sizeof(bucket[0]));
}
void teardown
(
void
)
{
delete[] bucket;
}
void show
(
int encoded
)
{
int z;
int y;
int j;
for (z = width - 1; z >= 0; z--)
{
int n = 1;
for (y = 0; y < z; y++)
n *= 4;
j = encoded % n;
encoded -= j;
encoded /= n;
cout << charMap[encoded];
encoded = j;
}
cout << endl;
}
int main(void)
{
// Sort this sequence
const char *testSequence = "CAGCCCAAAGGGTTTAGACTTGGTGCGCAGCAGTTAAGATTGTTT";
size_t testSequenceLength = strlen(testSequence);
setup();
// load the sequences into the buckets
size_t z;
for (z = 0; z < testSequenceLength; z += width)
{
int encoding = 0;
size_t y;
for (y = 0; y < width; y++)
{
encoding *= 4;
switch (*(testSequence + z + y))
{
case 'A' : encoding += 0; break;
case 'C' : encoding += 1; break;
case 'G' : encoding += 2; break;
case 'T' : encoding += 3; break;
default : abort();
};
}
bucket[encoding]++;
}
/* show the sorted sequences */
for (z = 0; z < bucketCount; z++)
{
while (bucket[z] > 0)
{
show(z);
bucket[z]--;
}
}
teardown();
return 0;
}
If your data set is so big, then I would think that a disk-based buffer approach would be best:
sort(List<string> elements, int prefix)
if (elements.Count < THRESHOLD)
return InMemoryRadixSort(elements, prefix)
else
return DiskBackedRadixSort(elements, prefix)
DiskBackedRadixSort(elements, prefix)
DiskBackedBuffer<string>[] buckets
foreach (element in elements)
buckets[element.MSB(prefix)].Add(element);
List<string> ret
foreach (bucket in buckets)
ret.Add(sort(bucket, prefix + 1))
return ret
I would also experiment grouping into a larger number of buckets, for instance, if your string was:
GATTACA
the first MSB call would return the bucket for GATT (256 total buckets), that way you make fewer branches of the disk based buffer. This may or may not improve performance, so experiment with it.
I'm going to go out on a limb and suggest you switch to a heap/heapsort implementation. This suggestion comes with some assumptions:
You control the reading of the data
You can do something meaningful with the sorted data as soon as you 'start' getting it sorted.
The beauty of the heap/heap-sort is that you can build the heap while you read the data, and you can start getting results the moment you have built the heap.
Let's step back. If you are so fortunate that you can read the data asynchronously (that is, you can post some kind of read request and be notified when some data is ready), and then you can build a chunk of the heap while you are waiting for the next chunk of data to come in - even from disk. Often, this approach can bury most of the cost of half of your sorting behind the time spent getting the data.
Once you have the data read, the first element is already available. Depending on where you are sending the data, this can be great. If you are sending it to another asynchronous reader, or some parallel 'event' model, or UI, you can send chunks and chunks as you go.
That said - if you have no control over how the data is read, and it is read synchronously, and you have no use for the sorted data until it is entirely written out - ignore all this. :(
See the Wikipedia articles:
Heapsort
Binary heap
"Radix sorting with no extra space" is a paper addressing your problem.
Performance-wise you might want to look at a more general string-comparison sorting algorithms.
Currently you wind up touching every element of every string, but you can do better!
In particular, a burst sort is a very good fit for this case. As a bonus, since burstsort is based on tries, it works ridiculously well for the small alphabet sizes used in DNA/RNA, since you don't need to build any sort of ternary search node, hash or other trie node compression scheme into the trie implementation. The tries may be useful for your suffix-array-like final goal as well.
A decent general purpose implementation of burstsort is available on source forge at http://sourceforge.net/projects/burstsort/ - but it is not in-place.
For comparison purposes, The C-burstsort implementation covered at http://www.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf benchmarks 4-5x faster than quicksort and radix sorts for some typical workloads.
You'll want to take a look at Large-scale Genome Sequence Processing by Drs. Kasahara and Morishita.
Strings comprised of the four nucleotide letters A, C, G, and T can be specially encoded into Integers for much faster processing. Radix sort is among many algorithms discussed in the book; you should be able to adapt the accepted answer to this question and see a big performance improvement.
You might try using a trie. Sorting the data is simply iterating through the dataset and inserting it; the structure is naturally sorted, and you can think of it as similar to a B-Tree (except instead of making comparisons, you always use pointer indirections).
Caching behavior will favor all of the internal nodes, so you probably won't improve upon that; but you can fiddle with the branching factor of your trie as well (ensure that every node fits into a single cache line, allocate trie nodes similar to a heap, as a contiguous array that represents a level-order traversal). Since tries are also digital structures (O(k) insert/find/delete for elements of length k), you should have competitive performance to a radix sort.
I would burstsort a packed-bit representation of the strings. Burstsort is claimed to have much better locality than radix sorts, keeping the extra space usage down with burst tries in place of classical tries. The original paper has measurements.
It looks like you've solved the problem, but for the record, it appears that one version of a workable in-place radix sort is the "American Flag Sort". It's described here: Engineering Radix Sort. The general idea is to do 2 passes on each character - first count how many of each you have, so you can subdivide the input array into bins. Then go through again, swapping each element into the correct bin. Now recursively sort each bin on the next character position.
Radix-Sort is not cache conscious and is not the fastest sort algorithm for large sets.
You can look at:
ti7qsort. ti7qsort is the fastest sort for integers (can be used for small-fixed size strings).
Inline QSORT
String sorting
You can also use compression and encode each letter of your DNA into 2 bits before storing into the sort array.
dsimcha's MSB radix sort looks nice, but Nils gets closer to the heart of the problem with the observation that cache locality is what's killing you at large problem sizes.
I suggest a very simple approach:
Empirically estimate the largest size m for which a radix sort is efficient.
Read blocks of m elements at a time, radix sort them, and write them out (to a memory buffer if you have enough memory, but otherwise to file), until you exhaust your input.
Mergesort the resulting sorted blocks.
Mergesort is the most cache-friendly sorting algorithm I'm aware of: "Read the next item from either array A or B, then write an item to the output buffer." It runs efficiently on tape drives. It does require 2n space to sort n items, but my bet is that the much-improved cache locality you'll see will make that unimportant -- and if you were using a non-in-place radix sort, you needed that extra space anyway.
Please note finally that mergesort can be implemented without recursion, and in fact doing it this way makes clear the true linear memory access pattern.
First, think about the coding of your problem. Get rid of the strings, replace them by a binary representation. Use the first byte to indicate length+encoding. Alternatively, use a fixed length representation at a four-byte boundary. Then the radix sort becomes much easier. For a radix sort, the most important thing is to not have exception handling at the hot spot of the inner loop.
OK, I thought a bit more about the 4-nary problem. You want a solution like a Judy tree for this. The next solution can handle variable length strings; for fixed length just remove the length bits, that actually makes it easier.
Allocate blocks of 16 pointers. The least significant bit of the pointers can be reused, as your blocks will always be aligned. You might want a special storage allocator for it (breaking up large storage into smaller blocks). There are a number of different kinds of blocks:
Encoding with 7 length bits of variable-length strings. As they fill up, you replace them by:
Position encodes the next two characters, you have 16 pointers to the next blocks, ending with:
Bitmap encoding of the last three characters of a string.
For each kind of block, you need to store different information in the LSBs. As you have variable length strings you need to store end-of-string too, and the last kind of block can only be used for the longest strings. The 7 length bits should be replaced by less as you get deeper into the structure.
This provides you with a reasonably fast and very memory efficient storage of sorted strings. It will behave somewhat like a trie. To get this working, make sure to build enough unit tests. You want coverage of all block transitions. You want to start with only the second kind of block.
For even more performance, you might want to add different block types and a larger size of block. If the blocks are always the same size and large enough, you can use even fewer bits for the pointers. With a block size of 16 pointers, you already have a byte free in a 32-bit address space. Take a look at the Judy tree documentation for interesting block types. Basically, you add code and engineering time for a space (and runtime) trade-off
You probably want to start with a 256 wide direct radix for the first four characters. That provides a decent space/time tradeoff. In this implementation, you get much less memory overhead than with a simple trie; it is approximately three times smaller (I haven't measured). O(n) is no problem if the constant is low enough, as you noticed when comparing with the O(n log n) quicksort.
Are you interested in handling doubles? With short sequences, there are going to be. Adapting the blocks to handle counts is tricky, but it can be very space-efficient.
While the accepted answer perfectly answers the description of the problem, I've reached this place looking in vain for an algorithm to partition inline an array into N parts. I've written one myself, so here it is.
Warning: this is not a stable partitioning algorithm, so for multilevel partitioning, one must repartition each resulting partition instead of the whole array. The advantage is that it is inline.
The way it helps with the question posed is that you can repeatedly partition inline based on a letter of the string, then sort the partitions when they are small enough with the algorithm of your choice.
function partitionInPlace(input, partitionFunction, numPartitions, startIndex=0, endIndex=-1) {
if (endIndex===-1) endIndex=input.length;
const starts = Array.from({ length: numPartitions + 1 }, () => 0);
for (let i = startIndex; i < endIndex; i++) {
const val = input[i];
const partByte = partitionFunction(val);
starts[partByte]++;
}
let prev = startIndex;
for (let i = 0; i < numPartitions; i++) {
const p = prev;
prev += starts[i];
starts[i] = p;
}
const indexes = [...starts];
starts[numPartitions] = prev;
let bucket = 0;
while (bucket < numPartitions) {
const start = starts[bucket];
const end = starts[bucket + 1];
if (end - start < 1) {
bucket++;
continue;
}
let index = indexes[bucket];
if (index === end) {
bucket++;
continue;
}
let val = input[index];
let destBucket = partitionFunction(val);
if (destBucket === bucket) {
indexes[bucket] = index + 1;
continue;
}
let dest;
do {
dest = indexes[destBucket] - 1;
let destVal;
let destValBucket = destBucket;
while (destValBucket === destBucket) {
dest++;
destVal = input[dest];
destValBucket = partitionFunction(destVal);
}
input[dest] = val;
indexes[destBucket] = dest + 1;
val = destVal;
destBucket = destValBucket;
} while (dest !== index)
}
return starts;
}