Word-separating algorithm - algorithm

What is the algorithm - seemingly in use on domain parking pages - that takes a spaceless bunch of words (eg "thecarrotofcuriosity") and more-or-less correctly breaks it down into the constituent words (eg "the carrot of curiosity") ?

Start with a basic Trie data structure representing your dictionary. As you iterate through the characters of the the string, search your way through the trie with a set of pointers rather than a single pointer - the set is seeded with the root of the trie. For each letter, the whole set is advanced at once via the pointer indicated by the letter, and if a set element cannot be advanced by the letter, it is removed from the set. Whenever you reach a possible end-of-word, add a new root-of-trie to the set (keeping track of the list of words seen associated with that set element). Finally, once all characters have been processed, return an arbitrary list of words which is at the root-of-trie. If there's more than one, that means the string could be broken up in multiple ways (such as "therapistforum" which can be parsed as ["therapist", "forum"] or ["the", "rapist", "forum"]) and it's undefined which we'll return.
Or, in a wacked up pseudocode (Java foreach, tuple indicated with parens, set indicated with braces, cons using head :: tail, [] is the empty list):
List<String> breakUp(String str, Trie root) {
Set<(List<String>, Trie)> set = {([], root)};
for (char c : str) {
Set<(List<String>, Trie)> newSet = {};
for (List<String> ls, Trie t : set) {
Trie tNext = t.follow(c);
if (tNext != null) {
newSet.add((ls, tNext));
if (tNext.isWord()) {
newSet.add((t.follow(c).getWord() :: ls, root));
}
}
}
set = newSet;
}
for (List<String> ls, Trie t : set) {
if (t == root) return ls;
}
return null;
}
Let me know if I need to clarify or I missed something...

I would imagine they take a dictionary word list like /usr/share/dict/words on your common or garden variety Unix system and try to find sets of word matches (starting from the left?) that result in the largest amount of original text being covered by a match. A simple breadth-first-search implementation would probably work fine, since it obviously doesn't have to run fast.

I'd imaging these sites do it similar to this:
Get a list of word for your target language
Remove "useless" words like "a", "the", ...
Run through the list and check which of the words are substrings of the domain name
Take the most common words of the remaining list (Or the ones with the highest adsense rating,...)
Of course that leads to nonsense for expertsexchange, but what else would you expect there...

(disclaimer: I did not try it myself, so take it merely as a food for experimentation. 4-grams are taken mostly out of the blue sky, just from my experience that 3-grams won't work all too well; 5-grams and more might work better, even though you will have to deal with a pretty large table). It's also simplistic in a sense that it does not take into the account the ending of the string - if it works for you otherwise, you'd probably need to think about fixing the endings.
This algorithm would run in a predictable time proportional to the length of the string that you are trying to split.
So, first: Take a lot of human-readable texts. for each of the text, supposing it is in a single string str, run the following algorithm (pseudocode-ish notation, assumes the [] is a hashtable-like indexing, and that nonexistent indexes return '0'):
for(i=0;i<length(s)-5;i++) {
// take 4-character substring starting at position i
subs2 = substring(str, i, 4);
if(has_space(subs2)) {
subs = substring(str, i, 5);
delete_space(subs);
yes_space[subs][position(space, subs2)]++;
} else {
subs = subs2;
no_space[subs]++;
}
}
This will build you the tables which will help to decide whether a given 4-gram would need to have a space in it inserted or not.
Then, take your string to split, I denote it as xstr, and do:
for(i=0;i<length(xstr)-5;i++) {
subs = substring(xstr, i, 4);
for(j=0;j<4;j++) {
do_insert_space_here[i+j] -= no_space[subs];
}
for(j=0;j<4;j++) {
do_insert_space_here[i+j] += yes_space[subs][j];
}
}
Then you can walk the "do_insert_space_here[]" array - if an element at a given position is bigger than 0, then you should insert a space in that position in the original string. If it's less than zero, then you shouldn't.
Please drop a note here if you try it (or something of this sort) and it works (or does not work) for you :-)

Related

Kotlin map not working with List of String

I have been working on code where I have to generate all possible ways to the target string. I am using the below-mentioned code.
Print Statement:
println("---------- How Construct -------")
println("${
window.howConstruct("purple", listOf(
"purp",
"p",
"ur",
"le",
"purpl"
))
}")
Function Call:
fun howConstruct(
target: String,
wordBank: List<String>,
): List<List<String>> {
if (target.isEmpty()) return emptyList()
var result = emptyList<List<String>>()
for (word in wordBank) {
if (target.indexOf(word) == 0) { // Starting with prefix
val substring = target.substring(word.length)
val suffixWays = howConstruct(substring, wordBank)
val targetWays = suffixWays.map { way ->
val a = way.toMutableList().apply {
add(word)
}
a.toList()
}
result = targetWays
}
}
return result
}
Expected Output:-
[['purp','le'],['p','ur','p','le']]
Current Output:-
[]
Your code is almost working; only a couple of small changes are needed to get the required output:
If the target is empty, return listOf(emptyList()) instead of emptyList().
Use add(0, word) instead of add(word).
The first of those changes is the important one. Your function returns a list of matches; and since each match is itself a list of strings, it returns a list of lists of strings. Once your code has matched the entire target and calls itself one last time, it returned an empty list — i.e. no matches — instead of a list containing an empty list — meaning one match with no remaining strings.
The second change simply fixes the order of strings within each match, which was reversed (because it appended the prefix after the returned suffix match).
However, there are many others ways that code could be improved. Rather than list them all individually, it's probably easier to give an alternative version:
fun howConstruct(target: String, wordBank: List<String>
): List<List<String>>
= if (target == "") listOf(emptyList())
else wordBank.filter{ target.endsWith(it) } // Look for suffixes of the target in the word bank
.flatMap { suffix: String ->
howConstruct(target.removeSuffix(suffix), wordBank) // For each, recurse to search the rest
.map{ it + suffix } } // And append the suffix to each match.
That does almost exactly the same as your code, except that it searches from the end of the string — matching suffixes — instead of from the beginning. The result is the same; the main benefit is that it's simpler to append a suffix string to a partial match list (using +) than to prepend a prefix (which is quite messy, as you found).
However, it's a lot more concise, mainly because it uses a functional style — in particular, it uses filter() to determine which words are valid suffixes, and flatMap() to collate the list of matches corresponding to each one recursively, as well as map() to append the suffix to each one (like your code does). That avoids all the business of looping over lists, creating lists, and adding to them. As a result, it doesn't need to deal with mutable lists or variables, avoiding some sources of confusion and error.
I've written it as an expression body (with = instead of { … }) for simplicity. I find that's simpler and clearer for short functions — this one is about the limit, though. It might fit as it an extension function on String, since it's effectively returning a transformation of the string, without any side-effects — though again, that tends to work best on short functions.
There are also several small tweaks. It's a bit simpler — and more efficient — to use startsWith() or endsWith() instead of indexOf(); removePrefix() or removeSuffix() is arguably slightly clearer than substring(); and I find == "" clearer than isEmpty().
(Also, the name howConstruct() doesn't really describe the result very well, but I haven't come up with anything better so far…)
Many of these changes are of course a matter of personal preference, and I'm sure other developers would write it in many other ways! But I hope this has given some ideas.

Breaking a string apart into a sequence of words

I recently came across the following interview question:
Given an input string and a dictionary of words, implement a method that breaks up the input string into a space-separated string of dictionary words that a search engine might use for "Did you mean?" For example, an input of "applepie" should yield an output of "apple pie".
I can't seem to get an optimal solution as far as complexity is concerned. Does anyone have any suggestions on how to do this efficiently?
Looks like the question is exactly my interview problem, down to the example I used in the post at The Noisy Channel. Glad you liked the solution. Am quite sure you can't beat the O(n^2) dynamic programming / memoization solution I describe for worst-case performance.
You can do better in practice if your dictionary and input aren't pathological. For example, if you can identify in linear time the substrings of the input string are in the dictionary (e.g., with a trie) and if the number of such substrings is constant, then the overall time will be linear. Of course, that's a lot of assumptions, but real data is often much nicer than a pathological worst case.
There are also fun variations of the problem to make it harder, such as enumerating all valid segmentations, outputting a best segmentation based on some definition of best, handling a dictionary too large to fit in memory, and handling inexact segmentations (e.g., correcting spelling mistakes). Feel free to comment on my blog or otherwise contact me to follow up.
This link describes this problem as a perfect interview question and provides several methods to solve it. Essentially it involves recursive backtracking. At this level it would produce an O(2^n) complexity. An efficient solution using memoization could bring this problem down to O(n^2).
Using python, we can write two function, the first one segment returns the first segmentation of a piece of contiguous text into words given a dictionary or None if no such segmentation is found. Another function segment_all returns a list of all segmentations found. Worst case complexity is O(n**2) where n is the input string length in chars.
The solution presented here can be extended to include spelling corrections and bigram analysis to determine the most likely segmentation.
def memo(func):
'''
Applies simple memoization to a function
'''
cache = {}
def closure(*args):
if args in cache:
v = cache[args]
else:
v = func(*args)
cache[args] = v
return v
return closure
def segment(text, words):
'''
Return the first match that is the segmentation of 'text' into words
'''
#memo
def _segment(text):
if text in words: return text
for i in xrange(1, len(text)):
prefix, suffix = text[:i], text[i:]
segmented_suffix = _segment(suffix)
if prefix in words and segmented_suffix:
return '%s %s' % (prefix, segmented_suffix)
return None
return _segment(text)
def segment_all(text, words):
'''
Return a full list of matches that are the segmentation of 'text' into words
'''
#memo
def _segment(text):
matches = []
if text in words:
matches.append(text)
for i in xrange(1, len(text)):
prefix, suffix = text[:i], text[i:]
segmented_suffix_matches = _segment(suffix)
if prefix in words and len(segmented_suffix_matches):
for match in segmented_suffix_matches:
matches.append('%s %s' % (prefix, match))
return matches
return _segment(text)
if __name__ == "__main__":
string = 'cargocultscience'
words = set('car cargo go cult science'.split())
print segment(string, words)
# >>> car go cult science
print segment_all(string, words)
# >>> ['car go cult science', 'cargo cult science']
One option would be to store all valid English words in a trie. Once you've done this, you could start walking the trie from the root downward, following the letters in the string. Whenever you find a node that's marked as a word, you have two options:
Break the input at this point, or
Continue extending the word.
You can claim that you've found a match once you have broken the input up into a set of words that are all legal and have no remaining characters left. Since at each letter you either have one forced option (either you are building a word that isn't valid and should stop -or- you can keep extending the word) or two options (split or keep going), you could implement this function using exhaustive recursion:
PartitionWords(lettersLeft, wordSoFar, wordBreaks, trieNode):
// If you walked off the trie, this path fails.
if trieNode is null, return.
// If this trie node is a word, consider what happens if you split
// the word here.
if trieNode.isWord:
// If there is no input left, you're done and have a partition.
if lettersLeft is empty, output wordBreaks + wordSoFar and return
// Otherwise, try splitting here.
PartitinWords(lettersLeft, "", wordBreaks + wordSoFar, trie root)
// Otherwise, consume the next letter and continue:
PartitionWords(lettersLeft.substring(1), wordSoFar + lettersLeft[0],
wordBreaks, trieNode.child[lettersLeft[0])
In the pathologically worst case this will list all partitions of the string, which can t exponentially long. However, this only occurs if you can partition the string in a huge number of ways that all start with valid English words, and is unlikely to occur in practice. If the string has many partitions, we might spend a lot of time finding them, though. For example, consider the string "dotheredo." We can split this many ways:
do the redo
do the red o
doth ere do
dot here do
dot he red o
dot he redo
To avoid this, you might want to institute a limit of the number of answers you report, perhaps two or three.
Since we cut off the recursion when we walk off the trie, if we ever try a split that doesn't leave the remainder of the string valid, we will detect this pretty quickly.
Hope this helps!
import java.util.*;
class Position {
int indexTest,no;
Position(int indexTest,int no)
{
this.indexTest=indexTest;
this.no=no;
} } class RandomWordCombo {
static boolean isCombo(String[] dict,String test)
{
HashMap<String,ArrayList<String>> dic=new HashMap<String,ArrayList<String>>();
Stack<Position> pos=new Stack<Position>();
for(String each:dict)
{
if(dic.containsKey(""+each.charAt(0)))
{
//System.out.println("=========it is here");
ArrayList<String> temp=dic.get(""+each.charAt(0));
temp.add(each);
dic.put(""+each.charAt(0),temp);
}
else
{
ArrayList<String> temp=new ArrayList<String>();
temp.add(each);
dic.put(""+each.charAt(0),temp);
}
}
Iterator it = dic.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry)it.next();
System.out.println("key: "+pair.getKey());
for(String str:(ArrayList<String>)pair.getValue())
{
System.out.print(str);
}
}
pos.push(new Position(0,0));
while(!pos.isEmpty())
{
Position position=pos.pop();
System.out.println("position index: "+position.indexTest+" no: "+position.no);
if(dic.containsKey(""+test.charAt(position.indexTest)))
{
ArrayList<String> strings=dic.get(""+test.charAt(position.indexTest));
if(strings.size()>1&&position.no<strings.size()-1)
pos.push(new Position(position.indexTest,position.no+1));
String str=strings.get(position.no);
if(position.indexTest+str.length()==test.length())
return true;
pos.push(new Position(position.indexTest+str.length(),0));
}
}
return false;
}
public static void main(String[] st)
{
String[] dic={"world","hello","super","hell"};
System.out.println("is 'hellworld' a combo: "+isCombo(dic,"superman"));
} }
I have done similar problem. This solution gives true or false if given string is combination of dictionary words. It can be easily converted to get space-separated string. Its average complexity is O(n), where n: no of dictionary words in given string.

Efficient data structure/algorithm for transliteration based word lookup

I'm looking for a efficient data structure/algorithm for storing and searching transliteration based word lookup (like google do: http://www.google.com/transliterate/ but I'm not trying to use google transliteration API). Unfortunately, the natural language I'm trying to work on doesn't have any soundex implemented, so I'm on my own.
For an open source project currently I'm using plain arrays for storing word list and dynamically generating regular expression (based on user input) to match them. It works fine, but regular expression is too powerful or resource intensive than I need. For example, I'm afraid this solution will drain too much battery if I try to port it to handheld devices, as searching over thousands of words with regular expression is too much costly.
There must be a better way to accomplish this for complex languages, how does Pinyin input method work for example? Any suggestion on where to start?
Thanks in advance.
Edit: If I understand correctly, this is suggested by #Dialecticus-
I want to transliterate from Language1, which has 3 characters a,b,c to Language2, which has 6 characters p,q,r,x,y,z. As a result of difference in numbers of characters each language possess and their phones, it is not often possible to define one-to-one mapping.
Lets assume phonetically here is our associative arrays/transliteration table:
a -> p, q
b -> r
c -> x, y, z
We also have a valid word lists in plain arrays for Language2:
...
px
qy
...
If the user types ac, the possible combinations become px, py, pz, qx, qy, qz after transliteration step 1. In step 2 we have to do another search in valid word list and will have to eliminate everyone of them except px and qy.
What I'm doing currently is not that different from the above approach. Instead of making possible combinations using the transliteration table, I'm building a regular expression [pq][xyz] and matching that with my valid word list, which provides the output px and qy.
I'm eager to know if there is any better method than that.
From what I understand, you have an input string S in an alphabet (lets call it A1) and you want to convert it to the string S' which is its equivalent in another alphabet A2. Actually, if I understand correctly, you want to generate a list [S'1,S'2,...,S'n] of output strings which might potentially be equivalent to S.
One approach that comes to mind is for each word in the list of valid words in A2 generate a list of strings in A1 that matches the. Using the example in your edit, we have
px->ac
qy->ac
pr->ab
(I have added an extra valid word pr for clarity)
Now that we know what possible series of input symbols will always map to a valid word, we can use our table to build a Trie.
Each node will hold a pointer to a list of valid words in A2 that map to the sequence of symbols in A1 that form the path from the root of the Trie to the current node.
Thus for our example, the Trie would look something like this
Root (empty)
| a
|
V
+---Node (empty)---+
| b | c
| |
V V
Node (px,qy) Node (pr)
Starting at the root node, as symbols are consumed transitions are made from the current node to its child marked with the symbol consumed until we have read the entire string. If at any point no transition is defined for that symbol, the entered string does not exist in our trie and thus does not map to a valid word in our target language. Otherwise, at the end of the process, the list of words associated with the current node is the list of valid words the input string maps to.
Apart from the initial cost of building the trie (the trie can be shipped pre-built if we never want the list of valid words to change), this takes O(n) on the length of the input to find a list of mapping valid words.
Using a Trie also provide the advantage that you can also use it to find the list of all valid words that can be generated by adding more symbols to the end of the input - i.e. a prefix match. For example, if fed with the input symbol 'a', we can use the trie to find all valid words that can begin with 'a' ('px','qr','py'). But doing that is not as fast as finding the exact match.
Here's a quick hack at a solution (in Java):
import java.util.*;
class TrieNode{
// child nodes - size of array depends on your alphabet size,
// her we are only using the lowercase English characters 'a'-'z'
TrieNode[] next=new TrieNode[26];
List<String> words;
public TrieNode(){
words=new ArrayList<String>();
}
}
class Trie{
private TrieNode root=null;
public void addWord(String sourceLanguage, String targetLanguage){
root=add(root,sourceLanguage.toCharArray(),0,targetLanguage);
}
private static int convertToIndex(char c){ // you need to change this for your alphabet
return (c-'a');
}
private TrieNode add(TrieNode cur, char[] s, int pos, String targ){
if (cur==null){
cur=new TrieNode();
}
if (s.length==pos){
cur.words.add(targ);
}
else{
cur.next[convertToIndex(s[pos])]=add(cur.next[convertToIndex(s[pos])],s,pos+1,targ);
}
return cur;
}
public List<String> findMatches(String text){
return find(root,text.toCharArray(),0);
}
private List<String> find(TrieNode cur, char[] s, int pos){
if (cur==null) return new ArrayList<String>();
else if (pos==s.length){
return cur.words;
}
else{
return find(cur.next[convertToIndex(s[pos])],s,pos+1);
}
}
}
class MyMiniTransliiterator{
public static void main(String args[]){
Trie t=new Trie();
t.addWord("ac","px");
t.addWord("ac","qy");
t.addWord("ab","pr");
System.out.println(t.findMatches("ac")); // prints [px,qy]
System.out.println(t.findMatches("ab")); // prints [pr]
System.out.println(t.findMatches("ba")); // prints empty list since this does not match anything
}
}
This is a very simple trie, no compression or speedups and only works on lower case English characters for the input language. But it can be easily modified for other character sets.
I would build transliterated sentence one symbol at the time, instead of one word at the time. For most languages it is possible to transliterate every symbol independently of other symbols in the word. You can still have exceptions as whole words that have to be transliterated as complete words, but transliteration table of symbols and exceptions will surely be smaller than transliteration table of all existing words.
Best structure for transliteration table is some sort of associative array, probably utilizing hash tables. In C++ there's std::unordered_map, and in C# you would use Dictionary.

How to split a string into words. Ex: "stringintowords" -> "String Into Words"?

What is the right way to split a string into words ?
(string doesn't contain any spaces or punctuation marks)
For example: "stringintowords" -> "String Into Words"
Could you please advise what algorithm should be used here ?
! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.
Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).
This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.
As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:
(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).
(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).
There should be a fair bit in the academic literature on this. The key words you want to search for are word segmentation. This paper looks promising, for example.
In general, you'll probably want to learn about markov models and the viterbi algorithm. The latter is a dynamic programming algorithm that may allow you to find plausible segmentations for a string without exhaustively testing every possible segmentation. The essential insight here is that if you have n possible segmentations for the first m characters, and you only want to find the most likely segmentation, you don't need to evaluate every one of these against subsequent characters - you only need to continue evaluating the most likely one.
If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.
For example: windowsteamblog (of http://windowsteamblog.com/ fame)
windows team blog
window steam blog
Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.
You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).
Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).
If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:
Needle: isashort
Haystack: This is a short phrase
Preprocessed: thisisashortphrase
NumSpaces : 000011233333444444
And your answer would come from:
location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)
Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.
This can be actually done (to a certain degree) without dictionary. Essentially, this is an unsupervised word segmentation problem. You need to collect a large list of domain names, apply an unsupervised segmentation learning algorithm (e.g. Morfessor) and apply the learned model for new domain names. I'm not sure how well it would work, though (but it would be interesting).
This is basically a variation of a knapsack problem, so what you need is a comprehensive list of words and any of the solutions covered in Wiki.
With fairly-sized dictionary this is going to be insanely resource-intensive and lengthy operation, and you cannot even be sure that this problem will be solved.
Create a list of possible words, sort it from long words to short words.
Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.
A simple Java solution which has O(n^2) running time.
public class Solution {
// should contain the list of all words, or you can use any other data structure (e.g. a Trie)
private HashSet<String> dictionary;
public String parse(String s) {
return parse(s, new HashMap<String, String>());
}
public String parse(String s, HashMap<String, String> map) {
if (map.containsKey(s)) {
return map.get(s);
}
if (dictionary.contains(s)) {
return s;
}
for (int left = 1; left < s.length(); left++) {
String leftSub = s.substring(0, left);
if (!dictionary.contains(leftSub)) {
continue;
}
String rightSub = s.substring(left);
String rightParsed = parse(rightSub, map);
if (rightParsed != null) {
String parsed = leftSub + " " + rightParsed;
map.put(s, parsed);
return parsed;
}
}
map.put(s, null);
return null;
}
}
I was looking at the problem and thought maybe I could share how I did it.
It's a little too hard to explain my algorithm in words so maybe I could share my optimized solution in pseudocode:
string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);
/** this way, one does not check the dictionary to check for word validity
* on every substring; It would only be queried once and for all,
* eliminating multiple travels to the data storage
*/
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();
validwords = validwords.sort(length, desc);
array segments = [];
while(mainword != ""){
for(x = 0; x < validwords.length; x++){
if(mainword.startswith(validwords[x])) {
segments.push(validwords[x]);
mainword = mainword.remove(v);
x = 0;
}
}
/**
* remove the first character if any of valid words do not match, then start again
* you may need to add the first character to the result if you want to
*/
mainword = mainword.substring(1);
}
string result = segments.join(" ");

An efficient technique to replace an occurence in a sequence with mutable or immutable state

I am searching for an efficient a technique to find a sequence of Op occurences in a Seq[Op]. Once an occurence is found, I want to replace the occurence with a defined replacement and run the same search again until the list stops changing.
Scenario:
I have three types of Op case classes. Pop() extends Op, Push() extends Op and Nop() extends Op. I want to replace the occurence of Push(), Pop() with Nop(). Basically the code could look like seq.replace(Push() ~ Pop() ~> Nop()).
Problem:
Now that I call seq.replace(...) I will have to search in the sequence for an occurence of Push(), Pop(). So far so good. I find the occurence. But now I will have to splice the occurence form the list and insert the replacement.
Now there are two options. My list could be mutable or immutable. If I use an immutable list I am scared regarding performance because those sequences are usually 500+ elements in size. If I replace a lot of occurences like A ~ B ~ C ~> D ~ E I will create a lot of new objects If I am not mistaken. However I could also use a mutable sequence like ListBuffer[Op].
Basically from a linked-list background I would just do some pointer-bending and after a total of four operations I am done with the replacement without creating new objects. That is why I am now concerned about performance. Especially since this is a performance-critical operation for me.
Question:
How would you implement the replace() method in a Scala fashion and what kind of data structure would you use keeping in mind that this is a performance-critical operation?
I am happy with answers that point me in the right direction or pseudo code. No need to write a full replace method.
Thank you.
Ok, some considerations to be made. First, recall that, on lists, tail does not create objects, and prepending (::) only creates one object for each prepended element. That's pretty much as good as you can get, generally speaking.
One way of doing this would be this:
def myReplace(input: List[Op], pattern: List[Op], replacement: List[Op]) = {
// This function should be part of an KMP algorithm instead, for performance
def compare(pattern: List[Op], list: List[Op]): Boolean = (pattern, list) match {
case (x :: xs, y :: ys) if x == y => compare(xs, ys)
case (Nil, Nil) => true
case _ => false
}
var processed: List[Op] = Nil
var unprocessed: List[Op] = input
val patternLength = pattern.length
val reversedReplacement = replacement.reverse
// Do this until we finish processing the whole sequence
while (unprocessed.nonEmpty) {
// This inside algorithm would be better if replaced by KMP
// Quickly process non-matching sequences
while (unprocessed.nonEmpty && unprocessed.head != pattern.head) {
processed ::= unprocessed.head
unprocessed = unprocessed.tail
}
if (unprocessed.nonEmpty) {
if (compare(pattern, unprocessed)) {
processed :::= reversedReplacement
unprocessed = unprocessed drop patternLength
} else {
processed ::= unprocessed.head
unprocessed = unprocessed.tail
}
}
}
processed.reverse
}
You may gain speed by using KMP, particularly if the pattern searched for is long.
Now, what is the problem with this algorithm? The problem is that it won't test if the replaced pattern causes a match before that position. For instance, if I replace ACB with C, and I have an input AACBB, then the result of this algorithm will be ACB instead of C.
To avoid this problem, you should create a backtrack. First, you check at which position in your pattern the replacement may happen:
val positionOfReplacement = pattern.indexOfSlice(replacement)
Then, you modify the replacement part of the algorithm this:
if (compare(pattern, unprocessed)) {
if (positionOfReplacement > 0) {
unprocessed :::= replacement
unprocessed :::= processed take positionOfReplacement
processed = processed drop positionOfReplacement
} else {
processed :::= reversedReplacement
unprocessed = unprocessed drop patternLength
}
} else {
This will backtrack enough to solve the problem.
This algorithm won't deal efficiently, however, with multiply patterns at the same time, which I guess is where you are going. For that, you'll probably need some adaptation of KMP, to do it efficiently, or, otherwise, use a DFA to control possible matchings. It gets even worse if you want to match both AB and ABC.
In practice, the full blow problem is equivalent to regex match & replace, where the replace is a function of the match. Which means, of course, you may want to start looking into regex algorithms.
EDIT
I was forgetting to complete my reasoning. If that technique doesn't work for some reason, then my advice is going with an immutable tree-based vector. Tree-based vectors enable replacement of partial sequences with low amount of copying.
And if that doesn't do, then the solution is doubly linked lists. And pick one from a library with slice replacement -- otherwise you may end up spending way too much time debugging a known but tricky algorithm.

Resources