Search one character in a million characters string - algorithm

What's the best approach to search for one character in a million characters string? This is more from an algorithmic point of view rather than how to do it with a particular programming language?
Is binary search a good approach?

Without preprocessing, scan the string until you meet the target character. If you only need to check presence or the location of the first instance, you are done. Otherwise, you need to scan to the end.
With preprocessing
if you need to report presence or count, form an histogram (count of the instances for every possible value); this can be done in a single pass (with possible early termination if the count is not required). Then a query is done in constant time.
if you need to report the first instance (or some), fill a table of first-occurrence-indexes for each character value; this can be done in a single pass (with possible early termination). Then a query is done in constant time.
if you need to report all instances, you can prefill linked lists of all instances of every character; this can be done in a single pass, but the storage cost is heavy (one link per character). Then a query is done in time proportional to the number of occurrences.
Note that sorting with a general sort, then answering the queries by binary search is probably the worst thing you can do. General sorting will be more costly than needed (N Log(N) instead of N), and the queries will be expensive (Log(N) instead of 1). Not counting that if you need the location information, you'll have to augment the string with an extra field before sorting.
If the characters in the string are known to be in sorted order (a pretty unlikely situation !), the answer is different:
if you need to query just once, use a dichotomic search (two if you are asked the count or the range where the character is found).
if you need to perform more queries (at least S Log(S), where S is the size of the alphabet), then you can delimit the ranges of equal characters by a series of dichotomic searches.

Let L be the string length and S the alphabet size.
Without preprocessing, you need a sequential search. It will take a number of comparisons equal to the position of the first occurrence of the target character (or L if absent). Best case 1, worst case L, average case LS/K (for a uniform and balanced distribution with K occurrences of the target character).
With preprocessing, you fill a presence table by a sequential scan of the string. The number of character comparisons will equal the "last" first occurrence of any character (or L if one is absent). Best case S, worst case L. Extra storage of S bits is required. Subsequent queries are done in constant time.

Related

How to find the correct "word part" records that make up an input "word string", given a word part dataset?

In agglutinative languages, "words" is a fuzzy concept. Some agglutinative languages are like Turkish, Inuktitut, and many Native American languages (amongst others). In them, "words" are often/usually composed of a "base", and multiple prefixes/suffixes. So you might have ama-ebi-na-mo-kay-i-mang-na (I just made that up), where ebi is the base, and the rest are affixes. Let's say this means "walking early in the morning when the birds start singing", ama/early ebi/walk na/-ing mo/during kay/bird i/plural mang/sing na-ing. These words can get quite long, like 30+ "letters".
So I was playing around with creating a "dictionary" for a language like this, but it's not realistic to write definitions or "dictionary entries" as your typical English "words", because there are a possibly infinite number of words! (All combinations of prefixes/bases/suffixes). So instead, I was trying to think maybe you could have just these "word parts" in the database (prefixes/suffixes/bases, which can't stand by themselves actually in the real spoken language, but are clearly distinct in terms of adding meaning). By having a database of word parts, you would then (in theory) query by passing as input a long say 20-character "word", and it would figure out how to break this word down into word parts because of the database (somehow).
That is, it would take amaebinamokayimangna as input, and know that it can be broken down into ama-ebi-na-mo-kay-i-mang-na, and then it simply queries the database for those parts to return whatever metadata is associated with those parts.
What would you need to do to accomplish this basically, at a high level? Assuming you had a database (SQL or just in a text file) containing these affixes and bases, how could you take the input and know that it breaks down into these parts organized in this way? Maybe it turns out there is are other parts in the DB which can be arrange like a-ma-e-bina-mo-kay-im-ang-na, which is spelled the the exact same way (if you remove the hyphens), so it would likely find that as a result too, and return it as another possible match.
The only way (naive way) I can think of solving this currently, is to break the input string into ngrams like this:
function getNgrams(str, { min = 1, max = 8 } = {}) {
const ngrams = []
const points = Array.from(str)
const n = points.length
let minSize = min
while (minSize <= max) {
for (let i = 0; i < (n - minSize + 1); i++) {
const ngram = points.slice(i, i + minSize)
ngrams.push(ngram.join(''))
}
minSize++
}
return ngrams
}
And it would then check the database if any of those ngrams exist, maybe passing in if this is a prefix (start of word), infix, or suffix (end of word) part. The database parts table would have { id, text, is_start, is_end } sort of thing. But this would be horribly inefficient and probably wouldn't work. It seems really complex how you might go about solving this.
So wondering, how would you solve this? At a high level, what is the main vision you see of how you would tackle this, either in a SQL database or some other approach?
The goal is, save to some persisted area the word parts, and how they are combined (if they are a prefix/infix/suffix), and then take as input a string which could be generated from those parts, and try and figure out what the parts are from the persisted data, and then return those parts in the correct order.
First consider the simplified problem where we have a combination of prefixes only. To be able to split this into prefixes, we would do:
Store all the prefixes in a trie.
Let's say the input has n characters. Create an array of length n (of numbers, if you need just one possible split, or sets of numbers, if you need all possible splits). We will store in this array for each index, from which positions of the input string this index can be reached by adding a prefix from the dictionary.
For each substring starting with the 1st character of the input, if it belongs to the Trie, mark the index as can be reached from 0th position (i.e. there is a path from 0th position to k-th position). Trie allows us to do this in O(n)
For all i = 2..n, if the i-th character can be reached from the beginning, repeat the previous step for the substrings starting at i, mark their end position as "can be reached from (i-1)th position" as appropriate (i.e. there is a path from (i-1)th position to ((i-1)+k)th position).
At the end, we can traverse these indices backwards, starting at the end of the array. Each time we jump to an index stored in the array, we are skipping a prefix in the dictionary. Each path from the last position to the first position gives us a possible split. Since we repeated the 4-th step only for positions that can be reached from the 0-th position, all paths are guaranteed to end up at the 0-th position.
Building the array takes O(n^2) time (assuming we have the trie built already). Traversing the array to find all possible splits is O(n*s), where s is the number of possible splits. In any case, we can say if there is a possible split as soon as we have built the array.
The problem with prefixes, suffixes and base words is a slight modification of the above:
Build the "previous" indices for prefixes and "next" for suffixes (possibly starting from the end of the input and tracking the suffixes backwards).
For each base word in the string (all of which we can also find efficiently -O(n^2)- using a trie) see if the starting position can be reached from the left using prefixes, and end position can be reached from right using suffixes. If yes, you have a split.
As you can see, the keywords are trie and dynamic programming. The problem of finding only a single split requires O(n^2) time after the tries are built. Tries can be built in O(m) time where m is the total length of added strings.

Look for a data structure to match words by letters

Given a list of lowercase radom words, each word with same length, and many patterns each with some letters at some positions are specified while other letters are unknown, find out all words that matches each pattern.
For example, words list is:
["ixlwnb","ivknmt","vvqnbl","qvhntl"]
And patterns are:
i-----
-v---l
-v-n-l
With a naive algorithm, one can do an O(NL) travel for each pattern, where N is the words count and L is the word length.
But since there may be a lot of patterns travel on the same words list, is there any good data structure to preprocess and store the words list, then give a sufficient matching for all patterns?
One simple idea is to use an inverted index. First, number your words -- you'll refer to them using these indices rather than the words themselves for speed and space efficiency. Probably the index fits in a 32-bit int.
Now your inverted index: for each letter in each position, construct a sorted list of IDs for words that have that letter in that location.
To do your search, you take the lists of IDs for each of the letters in the positions you're given, and take the intersection of the lists, using a an algorithm like the "merge" in merge-sort. All IDs in the intersection match the input.
Alternatively, if your words are short enough (12 characters or fewer), you could compress them into 64 bit words (using 5 bits per letter, with letters 1-26). Construct a bit-mask with binary 11111 in places where you have a letter, and 00000 in places where you have a blank. And a bit-test from your input with the 5-bit code for each letter in each place, using 00000 where you have blanks. For example, if your input is a-c then your bitmask will be binary 111110000011111 and your bittest binary 000010000000011. Go through your word-list, take this bitwise and of each word with the bit-mask and test to see if it's equal to the bit-test value. This is cache friendly and the inner loop is tight, so may be competitive with algorithms that look like they should be faster on paper.
I'll preface this with it's more of a comment and less of an answer (I don't have enough reputation to comment though). I can't think of any data structure that will satisfy the requirements of of the box. It was interesting to think about, and figured I'd share one potential solution that popped into my head.
I keyed in on the "same length" part, and figured I could come up with something based on that.
In theory we could have N(N being the length) maps of char -> set.
When strings are added, it goes through each character and adds the string to the corresponding set. psuedocode:
firstCharMap[s[0]].insert(s);
secondCharMap[s[1]].insert(s);
thirdCharMap[s[2]].insert(s);
fourthCharMap[s[3]].insert(s);
fifthCharMap[s[4]].insert(s);
sixthCharMap[s[5]].insert(s);
Then to determine which strings match the pattern, we take just do an intersection of the sets ex: "-v-n-l" would be:
intersection of sets: secondCharMap[v], fourthCharMap[n], sixthCharMap[l]
One edge case that jumps out is if I wanted to just get all of the strings, so if that's a requirement--we may also need an additional set of all of the strings.
This solution feels clunky, but I think it could work. Depending on the language, number of strings, etc--I wouldn't be surprised if it performed worse than just iterating over all strings and checking a predicate.

Minimum number of deletions for a given word to become a dictionary word

Given a dictionary as a hashtable. Find the minimum # of
deletions needed for a given word in order to make it match any word in the
dictionary.
Is there some clever trick to solve this problem in less than exponential complexity (trying all possible combinations)?
For starters, suppose that you have a single word w in the the hash table and that your word is x. You can delete letters from x to form w if and only if w is a subsequence of x, and in that case the number of letters you need to delete from x to form w is given by |x - w|. So certainly one option would be to just iterate over the hash table and, for each word, to see if x is a subsequence of that word, taking the best match you find across the table.
To analyze the runtime of this operation, let's suppose that there are n total words in your hash table and that their total length is L. Then the runtime of this operation is O(L), since you'll process each character across all the words at most once. The complexity of your initial approach is O(|x| · 2|x|) because there are 2|x| possible words you can make by deleting letters from x and you'll spend O(|x|) time processing each one. Depending on the size of your dictionary and the size of your word, one algorithm might be better than the other, but we can say that the runtime is O(min{L, |x|·2|x|) if you take the better of the two approaches.
You can build a trie and then see where your given word fits into it. The difference in the depth of your word and the closest existing parent is the number of deletions required.

Complexity of binary search on a string

I have an sorted array of strings: eg: ["bar", "foo", "top", "zebra"] and I want to search if an input word is present in an array or not.
eg:
search (String[] str, String word) {
// binary search implemented + string comaparison.
}
Now binary search will account for complexity which is O(logn), where n is the length of an array. So for so good.
But, at some point we need to do a string compare, which can be done in linear time.
Now the input array can contain of words of different sizes. So when I
am calculating final complexity will the final answer be O(m*logn)
where m is the size of word we want to search in the array, which in our case
is "zebra" the word we want to search?
Yes, your thinking as well your proposed solution, both are correct. You need to consider the length of the longest String too in the overall complexity of String searching.
A trivial String compare is an O(m) operation, where m is the length of the larger of the two strings.
But, we can improve a lot, given that the array is sorted. As user "doynax" suggests,
Complexity can be improved by keeping track of how many characters got matched during
the string comparisons, and store the present count for the lower and
upper bounds during the search. Since the array is sorted we know that
the prefix of the middle entry to be tested next must match up to at
least the minimum of the two depths, and therefore we can skip
comparing that prefix. In effect we're always either making progress
or stopping the incremental comparisons immediately on a mismatch, and
thereby never needing to keep going over old ground.
So, overall m number of character comparisons would have to be done till the end of the string, if found OR else not even that much(if fails at early stage).
So, the overall complexity would be O(m + log n).
I was under the impression that what original poster said was correct by saying time complexity is O(m*logn).
If you use the suggested enhancement to improve the time complexity (to get O(m + logn)) by tracking previously matched letters I believe the below inputs would break it.
arr = [“abc”, “def”, “ghi”, “nlj”, “pfypfy”, “xyz”]
target = “nljpfy”
I expect this would incorrectly match on “pfypfy”. Perhaps one of the original posters can weigh in on this. Definitely curious to better understand what was proposed. It sounds like matched number of letters are skipped in next comparison.

Efficient data structure for word lookup with wildcards

I need to match a series of user inputed words against a large dictionary of words (to ensure the entered value exists).
So if the user entered:
"orange" it should match an entry "orange' in the dictionary.
Now the catch is that the user can also enter a wildcard or series of wildcard characters like say
"or__ge" which would also match "orange"
The key requirements are:
* this should be as fast as possible.
* use the smallest amount of memory to achieve it.
If the size of the word list was small I could use a string containing all the words and use regular expressions.
however given that the word list could contain potentially hundreds of thousands of enteries I'm assuming this wouldn't work.
So is some sort of 'tree' be the way to go for this...?
Any thoughts or suggestions on this would be totally appreciated!
Thanks in advance,
Matt
Put your word list in a DAWG (directed acyclic word graph) as described in Appel and Jacobsen's paper on the World's Fastest Scrabble Program (free copy at Columbia). For your search you will traverse this graph maintaining a set of pointers: on a letter, you make a deterministic transition to children with that letter; on a wildcard, you add all children to the set.
The efficiency will be roughly the same as Thompson's NFA interpretation for grep (they are the same algorithm). The DAWG structure is extremely space-efficient—far more so than just storing the words themselves. And it is easy to implement.
Worst-case cost will be the size of the alphabet (26?) raised to the power of the number of wildcards. But unless your query begins with N wildcards, a simple left-to-right search will work well in practice. I'd suggest forbidding a query to begin with too many wildcards, or else create multiple dawgs, e.g., dawg for mirror image, dawg for rotated left three characters, and so on.
Matching an arbitrary sequence of wildcards, e.g., ______ is always going to be expensive because there are combinatorially many solutions. The dawg will enumerate all solutions very quickly.
I would first test the regex solution and see whether it is fast enough - you might be surprised! :-)
However if that wasn't good enough I would probably use a prefix tree for this.
The basic structure is a tree where:
The nodes at the top level are all the possible first letters (i.e. probably 26 nodes from a-z assuming you are using a full dictionary...).
The next level down contains all the possible second letters for each given first letter
And so on until you reach an "end of word" marker for each word
Testing whether a given string with wildcards is contained in your dictionary is then just a simple recursive algorithm where you either have a direct match for each character position, or in the case of the wildcard you check each of the possible branches.
In the worst case (all wildcards but only one word with the right number of letters right at the end of the dictionary), you would traverse the entire tree but this is still only O(n) in the size of the dictionary so no worse than a full regex scan. In most cases it would take very few operations to either find a match or confirm that no such match exists since large branches of the search tree are "pruned" with each successive letter.
No matter which algorithm you choose, you have a tradeoff between speed and memory consumption.
If you can afford ~ O(N*L) memory (where N is the size of your dictionary and L is the average length of a word), you can try this very fast algorithm. For simplicity, will assume latin alphabet with 26 letters and MAX_LEN as the max length of word.
Create a 2D array of sets of integers, set<int> table[26][MAX_LEN].
For each word in you dictionary, add the word index to the sets in the positions corresponding to each of the letters of the word. For example, if "orange" is the 12345-th word in the dictionary, you add 12345 to the sets corresponding to [o][0], [r][1], [a][2], [n][3], [g][4], [e][5].
Then, to retrieve words corresponding to "or..ge", you find the intersection of the sets at [o][0], [r][1], [g][4], [e][5].
You can try a string-matrix:
0,1: A
1,5: APPLE
2,5: AXELS
3,5: EAGLE
4,5: HELLO
5,5: WORLD
6,6: ORANGE
7,8: LONGWORD
8,13:SUPERLONGWORD
Let's call this a ragged index-matrix, to spare some memory. Order it on length, and then on alphabetical order. To address a character I use the notation x,y:z: x is the index, y is the length of the entry, z is the position. The length of your string is f and g is the number of entries in the dictionary.
Create list m, which contains potential match indexes x.
Iterate on z from 0 to f.
Is it a wildcard and not the latest character of the search string?
Continue loop (all match).
Is m empty?
Search through all x from 0 to g for y that matches length. !!A!!
Does the z character matches with search string at that z? Save x in m.
Is m empty? Break loop (no match).
Is m not empty?
Search through all elements of m. !!B!!
Does not match with search? Remove from m.
Is m empty? Break loop (no match).
A wildcard will always pass the "Match with search string?". And m is equally ordered as the matrix.
!!A!!: Binary search on length of the search string. O(log n)
!!B!!: Binary search on alphabetical ordering. O(log n)
The reason for using a string-matrix is that you already store the length of each string (because it makes it search faster), but it also gives you the length of each entry (assuming other constant fields), such that you can easily find the next entry in the matrix, for fast iterating. Ordering the matrix isn't a problem: since this has only be done once the dictionary updates, and not during search-time.
If you are allowed to ignore case, which I assume, then make all the words in your dictionary and all the search terms the same case before anything else. Upper or lower case makes no difference. If you have some words that are case sensitive and others that are not, break the words into two groups and search each separately.
You are only matching words, so you can break the dictionary into an array of strings. Since you are only doing an exact match against a known length, break the word array into a separate array for each word length. So byLength[3] is the array off all words with length 3. Each word array should be sorted.
Now you have an array of words and a word with potential wild cards to find. Depending on wether and where the wildcards are, there are a few approaches.
If the search term has no wild cards, then do a binary search in your sorted array. You could do a hash at this point, which would be faster but not much. If the vast majority of your search terms have no wildcards, then consider a hash table or an associative array keyed by hash.
If the search term has wildcards after some literal characters, then do a binary search in the sorted array to find an upper and lower bound, then do a linear search in that bound. If the wildcards are all trailing then finding a non empty range is sufficient.
If the search term starts with wild cards, then the sorted array is no help and you would need to do a linear search unless you keep a copy of the array sorted by backwards strings. If you make such an array, then choose it any time there are more trailing than leading literals. If you do not allow leading wildcards then there is no need.
If the search term both starts and ends with wildcards, then you are stuck with a linear search within the words with equal length.
So an array of arrays of strings. Each array of strings is sorted, and contains strings of equal length. Optionally duplicate the whole structure with the sorting based on backwards strings for the case of leading wildcards.
The overall space is one or two pointers per word, plus the words. You should be able to store all the words in a single buffer if your language permits. Of course, if your language does not permit, grep is probably faster anyway. For a million words, that is 4-16MB for the arrays and similar for the actual words.
For a search term with no wildcards, performance would be very good. With wildcards, there will occasionally be linear searches across large groups of words. With the breakdown by length and a single leading character, you should never need to search more than a few percent of the total dictionary even in the worst case. Comparing only whole words of known length will always be faster than generic string matching.
Try to build a Generalized Suffix Tree if the dictionary will be matched by sequence of queries. There is linear time algorithm that can be used to build such tree (Ukkonen Suffix Tree Construction).
You can easily match (it's O(k), where k is the size of the query) each query by traversing from the root node, and use the wildcard character to match any character like typical pattern finding in suffix tree.

Resources