Can someone please point to tutorials on - "Token Suffix Trees".
From googling that same phrase and scanning the first couple of results, my guess is that they are talking about a suffix tree in which the "letters" (or "characters", or "elements") are not individual ASCII or UNICODE characters as we are accustomed to, but rather the lexical tokens from some computer language.
So e.g. for C you would have a "letter" called int, and another letter called (, and so on. I'm not sure exactly how tokens that are subsequences of other tokens (e.g. + is a subsequence of ++) would be handled, but my guess would be that they are handled in the same way the lexer deals with them, which is (for C at least) by always greedily building the longest token (so e.g. the 5 input characters +++++ will be lexed as ++, ++, +).
Not sure if it is what you are looking for, but your question reminds me of what I know as 'suffix trees on words', e.g. http://www.larsson.dogma.net/words-alg.pdf
Related
I have some text generated by some lousy OCR software.
The output contains mixture of words and space-separated characters, which should have been grouped into words. For example,
Expr e s s i o n Syntax
S u m m a r y o f T e r minology
should have been
Expression Syntax
Summary of Terminology
What algorithms can group characters into words?
If I program in Python, C#, Java, C or C++, what libraries provide the implementation of the algorithms?
Thanks.
Minimal approach:
In your input, remove the space before any single letter words. Mark the final words created as part of this somehow (prefix them with a symbol not in the input, for example).
Get a dictionary of English words, sorted longest to shortest.
For each marked word in your input, find the longest match and break that off as a word. Repeat on the characters left over in the original "word" until there's nothing left over. (In the case where there's no match just leave it alone.)
More sophisticated, overkill approach:
The problem of splitting words without spaces is a real-world problem in languages commonly written without spaces, such as Chinese and Japanese. I'm familiar with Japanese so I'll mainly speak with reference to that.
Typical approaches use a dictionary and a sequence model. The model is trained to learn transition properties between labels - part of speech tagging, combined with the dictionary, is used to figure out the relative likelihood of different potential places to split words. Then the most likely sequence of splits for a whole sentence is solved for using (for example) the Viterbi algorithm.
Creating a system like this is almost certainly overkill if you're just cleaning OCR data, but if you're interested it may be worth looking into.
A sample case where the more sophisticated approach will work and the simple one won't:
input: Playforthefunofit
simple output: Play forth efunofit (forth is longer than for)
sophistiated output: Play for the fun of it (forth efunofit is a low-frequency - that is, unnatural - transition, while for the is not)
You can work around the issue with the simple approach to some extent by adding common short-word sequences to your dictionary as units. For example, add forthe as a dictionary word, and split it in a post processing step.
Hope that helps - good luck!
I have a system that is confined to two alphanumeric characters. Some simple math shows that we get 1,296 combinations if we use all possible permutations 0-9 and a-z. Lower case letters cannot be distinguished from upper case, special characters (including a blank character) cannot be used.
Is there any creative mapping, perhaps to an external reference, to create a way to take this two character field significantly beyond 1,296 combinations?
Examples of identifers would be `00, OO, AZ, Z4, etc.'
Thanks!
I'm afraid not, no more than you could get a 3 bit number to represent more than 8 different numbers. If you're interested in the details you can look up information theory or Kolmogorov complexity. Essentially with only 1,296 combinations then you can only label 1,296 possible pieces of information.
As an example, consider if you had 1,297 things. All of those two letter combinations would take up the first 1,296 so what combination would be associated with the next one? It would have to be a repeat of something which you had earlier.
Shor also has some good material on this, and the implications of that sort of thing form the basis for a lot of file compression systems.
You could maybe squeeze out one more combination if you cheat, and allow a 'null' value to represent a different possibility, but thats not totally relevant to the idea of the question.
If you are restricted to two characters taken from an alphabet of 36, then you are limited to 36² distinct symbols, that's it.
More context is required to find workarounds, like stealing bits elsewhere, using symbols in pairs, breaking the case limitation, exploiting the history of transations...
The precise meaning of "a system that is confined to two alphanumeric characters" needs to be known to be able to suggest a workaround. Is that a space constraint? Do you need the restriction to 2 chars for efficiency? Does it need to work with other code that accepts or generates 2 char indexes?
If you have up to 1295 identifiers that are used often, and some others that occur only occasionally, you could choose an identifier, e.g. "ZZ", to indicate that another identifier is following. So "00" through to "ZY" would be 1295 simple 2-char identifiers, and "ZZ00" though to "ZZZZ" would be a further 1296 combined 4-char identifiers. (Or "ZZ0000" through to "ZZZZZZ" for a further 1296*1296 identifiers ...)
This could work for space constraints. For efficiency, it depends on whether the additional check to see if the identifier is "ZZ" is too expensive or not.
I'm writing a program that makes word declension for Polish language. In this language stems can vary in some cases (because of palatalization or mobile/fleeting e and other effects).
For example, we have word "karzeł" and it is basic dictionary form of word. It's stem is also 'karzeł'. But genitive form of this word is "karła" and stem is "karł". We can see here that 'e' dissapeared and 'rz' changes to 'r'.
Another example:
'uzda' -> stem 'uzd'
'uździe' -> stem 'uździ'
Alternation: 'zd' -> 'ździ'
I'd like to store in dictionary only basic form of stem ('karzeł' and 'uzd') and when I'll put in my program stem 'karł' or 'uździ' it will find proper basic stems. Alternations takes place only at the end of stem and contains maximum 4 letters of it.
Is there any algorithms that could do that? Levensthein distance treats all letters equally so if I type word 'barzeł' then the distance to stem 'karzeł' will be less than to stem 'karł'.
I thought also about neural networks but I'm not sure how to encode words (give each stem variation different id?).
Another idea is to write algorith which makes something like reversed alternation and creates set of possible stems and try to find them in dictionary.
I would like to highlight that I only want store basic form of stem and everything else makes on the fly.
First of all, I remember seeing a number of projects on Polish morphology around. So I would look at them first, before starting one of your own.
Regarding Levenshtein, as Pierre correctly noted in the comment, the distance function can be customized. And it should be. Let me put it this way: think of Levenshtein not as an algorithm of and in itself, but as a solution to a specific error model. First he suggests a model which says that when you are typing a word every letter can be either dropped or replaced by another one due to some random process (fingers not pressing the right keys). Then, his algorithm is just a generator of maximum likelihood solutions under this model. The more errors you allow, the smaller is the probability of this sequence of errors actually happening, the bigger is the score.
You (implicitly) state a very different hypothesis, though. That Polish stems may have certain flexibility at the end (some linguistic process that you do not fully understand within this framework). Then, when you strip your suffix (or something that looks like one), there are three options:
1) there is a chance that what you have here is just a different form of a stem you have stored in your dictionary, or
2) it is a completely different stem, or
3) you've stripped your suffix improperly and what you have is not stem at all.
You can heuristically estimate these probabilities by looking at how many letters in the beginning of the supposed stem match some dictionary entries, for example (how to find these entries is a related but different question). And then you can pick the guess that is the most plausible according to your metric/heuristic.
Now, note that you can use any algorithm to find the candidates in the dictionary. Including the Levenshtein algorithm - as long as you are reasonably sure that the right ones will be picked up. But obviously you are better off writing your own dictionary search algorithm that follows your own metric or emulates it. For example, by giving the biggest/prohibitive cost to the change of letters in the beginning of the word and reducing it as you go towards the end.
Im learning about automata. Can you please help me understand how automata with Kleene closure works? Let's say I have letters a,b,c and I need to find text that ends with Kleene star - like ab*bac - how will it work?
The question seems to be more about how an automaton would handle Kleene closure than what Kleene closure means.
With a simple regular expression, e.g., abc, it's pretty straightforward to design an automaton to recognize it. Each state essentially tells you where you are in the expression so far. State 0 means it's seen nothing yet. State 1 means it's seen a. State 2 means it's seen ab. Etc.
The difficulty with Kleene closure is that a pattern like ab*bc introduces ambiguity. Once the automaton has seen the a and is then faced with a b, it doesn't know whether that b is part of the b* or the literal b that follows it, and it won't know until it reads more symbols--maybe many more.
The simplistic answer is that the automaton simply has a state that literally means it doesn't know yet which path was taken.
In simple cases, you can build this automaton directly. In general cases, you usually build something called a non-deterministic finite automaton. You can either simulate the NDFA, or--if performance is critical--you can apply an algorithm that converts the NDFA to a deterministic one. The algorithm essentially generates all the ambiguous states for you.
The Kleene star('*') means you can have as many occurrences of the character as you want (0 or more).
a* will match any number of a's.
(ab)* will match any number of the string "ab"
If you are trying to match an actual asterisk in an expression, the way you would write it depends entirely on the syntax of the regex you are working with. For the general case, the backwards slash \ is used as an escape character:
\* will match an asterisk.
For recognizing a pattern at the end, use concatenation:
(a U b)*c* will match any string that contains 0 or more 'c's at the end, preceded by any number of a's or b's.
For matching text that ends with a Kleene star, again, you can have 0 or more occurrences of the string:
ab(c)* - Possible matches: ab, abc abcc, abccc, etc.
a(bc)* - Possible matches: a, abc, abcbc, abcbcbc, etc.
Your expression ab*bac in English would read something like:
a followed by 0 or more b followed by bac
strings that would evaluate as a match to the regular expression if used for search
abac
abbbbbbbbbbac
abbac
strings that would not match
abaca //added extra literal
bac //missing leading a
As stated in the previous answer actually searching for a * would require an escape character which is implementation specific and would require knowledge of your language/library of choice.
Arising out of this question, I'm looking for an elegant (ruby) way to compute the word signature suggested in this answer.
The idea suggested is to sort the letters in the word, and also run length encode repeated letters. So, for example "mississippi" first becomes "iiiimppssss", and then could be further shortened by encoding as "4impp4s".
I'm relatively new to ruby and though I could hack something together, I'm sure this is a one liner for somebody with more experience of ruby. I'd be interested to see people's approaches and improve my ruby knowledge.
edit: to clarify, performance of computing the signature doesn't much matter for my application. I'm looking to compute the signature so I can store it with each word in a large database of words (450K words), then query for words which have the same signature (i.e. all anagrams of a given word, that are actual english words). Hence the focus on space. The 'elegant' part is just to satisfy my curiosity.
The fastest way to create a sorted list of the letters is this:
"mississippi".unpack("c*").sort.pack("c*")
It is quite a bit faster than split('') and join(). For comparison it is also best to pack the array back together into a String, so you dont have to compare arrays.
I'm not much of a Ruby person either, but as I noted on the other comment this seems to work for the algorithm described.
s = "mississippi"
s.split('').sort.join.gsub(/(.)\1{2,}/) { |s| s.length.to_s + s[0,1] }
Of course, you'll want to make sure the word is lowercase, doesn't contain numbers, etc.
As requested, I'll try to explain the code. Please forgive me if I don't get all of the Ruby or reg ex terminology correct, but here goes.
I think the split/sort/join part is pretty straightforward. The interesting part for me starts at the call to gsub. This will replace a substring that matches the regular expression with the return value from the block that follows it. The reg ex finds any character and creates a backreference. That's the "(.)" part. Then, we continue the matching process using the backreference "\1" that evaluates to whatever character was found by the first part of the match. We want that character to be found a minimum of two more times for a total minimum number of occurrences of three. This is done using the quantifier "{2,}".
If a match is found, the matching substring is then passed to the next block of code as an argument thanks to the "|s|" part. Finally, we use the string equivalent of the matching substring's length and append to it whatever character makes up that substring (they should all be the same) and return the concatenated value. The returned value replaces the original matching substring. The whole process continues until nothing is left to match since it's a global substitution on the original string.
I apologize if that's confusing. As is often the case, it's easier for me to visualize the solution than to explain it clearly.
I don't see an elegant solution. You could use the split message to get the characters into an array, but then once you've sorted the list I don't see a nice linear-time concatenate primitive to get back to a string. I'm surprised.
Incidentally, run-length encoding is almost certainly a waste of time. I'd have to see some very impressive measurements before I'd think it worth considering. If you avoid run-length encoding, you can anagrammatize any string, not just a string of letters. And if you know you have only letters and are trying to save space, you can pack them 5 bits to a letter.
---Irma Vep
EDIT: the other poster found join which I missed. Nice.