The best String reconstruction algorithm out there ? (Best as in 'most accurate') - algorithm

I've been searching and testing all kinds of string reconstruction algorithm i.e. reconstructing spaceless text into normal text.
My result posted here Solution working partially in Ruby, is working at 90% reconstruction for 2 or 3 words sentences, with a complete dictionary. But I can't get it to run better then this !
I think my algorithm inspired from dynamic programming is bad and contains a lot of patch work.
Can you propose another algorithm (in pseudo-code) that would work foolproof with a complete dictionary ?

You need more than just a dictionary, because you can have multiple possible phrases from the same spaceless string. For example, "themessobig" could be "the mess so big" or "themes so big" or "the mes so big", etc.
Those are all valid possibilities, but some are far more likely than others. Thus what you want to do is pick the most likely one given how the language is actually used. For this you need a huge corpus of text along with some NLP algorithms. Probably the most simple one is to count how likely a word is to occur after another word. So for "the mess so big", it's likelihood would be:
P(the | <START>) * P(mess | the) * P(so | mess) * P(big | so)
For "themes so big", the likelihood would be:
P(themes | <START>) * P(so | themes) * P(big | so)
Then you can pick the most likely of the possibilities. You can also construct triplets instead of tuples (e.g. P(so | the + mess)) which will require a bigger corpus to be effective.
This won't be foolproof but you can get better and better at it by having better corpuses or tweaking the algorithm.

With an unigram language model, which is essentially word frequencies,
it is possible to find the most probable segmentation of a string.
Example code from Russell & Norvig (2003, p. 837) (look for the function viterbi_segment)

Related

What algorithms can group characters into words?

I have some text generated by some lousy OCR software.
The output contains mixture of words and space-separated characters, which should have been grouped into words. For example,
Expr e s s i o n Syntax
S u m m a r y o f T e r minology
should have been
Expression Syntax
Summary of Terminology
What algorithms can group characters into words?
If I program in Python, C#, Java, C or C++, what libraries provide the implementation of the algorithms?
Thanks.
Minimal approach:
In your input, remove the space before any single letter words. Mark the final words created as part of this somehow (prefix them with a symbol not in the input, for example).
Get a dictionary of English words, sorted longest to shortest.
For each marked word in your input, find the longest match and break that off as a word. Repeat on the characters left over in the original "word" until there's nothing left over. (In the case where there's no match just leave it alone.)
More sophisticated, overkill approach:
The problem of splitting words without spaces is a real-world problem in languages commonly written without spaces, such as Chinese and Japanese. I'm familiar with Japanese so I'll mainly speak with reference to that.
Typical approaches use a dictionary and a sequence model. The model is trained to learn transition properties between labels - part of speech tagging, combined with the dictionary, is used to figure out the relative likelihood of different potential places to split words. Then the most likely sequence of splits for a whole sentence is solved for using (for example) the Viterbi algorithm.
Creating a system like this is almost certainly overkill if you're just cleaning OCR data, but if you're interested it may be worth looking into.
A sample case where the more sophisticated approach will work and the simple one won't:
input: Playforthefunofit
simple output: Play forth efunofit (forth is longer than for)
sophistiated output: Play for the fun of it (forth efunofit is a low-frequency - that is, unnatural - transition, while for the is not)
You can work around the issue with the simple approach to some extent by adding common short-word sequences to your dictionary as units. For example, add forthe as a dictionary word, and split it in a post processing step.
Hope that helps - good luck!

what is the relationship between LCS and string similarity?

I wanted to know how similar where two strings and I found a tool in the following page:
https://www.tools4noobs.com/online_tools/string_similarity/
and it says that this tool is based on the article:
"An O(ND) Difference Algorithm and its Variations"
available on:
http://www.xmailserver.org/diff2.pdf
I have read the article, but I have some doubts about how they programmed that tool, for example the authors said that it is based on the C library GNU diff and analyze.c; maybe it refers to this:
https://www.gnu.org/software/diffutils/
and this:
https://github.com/masukomi/dwdiff-annotated/blob/master/src/diff/analyze.c
The problem that I have is how to understand the relation with the article, for what I read the article shows an algorithm for finding the LCS (longest common subsequence) between a pair of strings, so they use a modification of the dynamic programming algorithm used for solving this problem. The modification is the use of the shortest path algorithm to find the LCS that has the minimum number of modifications.
At this point I am lost, because I do not know how the authors of the tool I first mentioned used the LCS for finding how similar are two sequences. Also the have put a limit value of 0.4, what does that mean? can anybody help me with this? or have I misunderstood that article?
Thanks
I think the description on the string similarity tool is not being entirely honest, because I'm pretty sure it has been implemented using the Perl module String::Similarity. The similarity score is normalised to a value between 0 and 1, and as the module page describes, the limit value can be used to abort the comparison early if the similarity falls below it.
If you download the Perl module and expand it, you can read the C source of the algorithm, in the file called fstrcmp.c, which says that it is "Derived from GNU diff 2.7, analyze.c et al.".
The connection between the LCS and string similarity is simply that those characters that are not in the LCS are precisely the characters you would need to add, delete or substitute in order to convert the first string to the second, and the number of these differing characters is usually used as the difference score, as in the Levenshtein Distance.

Diffing many texts against each other to derive template and data (finding common subsequences)

Suppose there are many texts that are known to be made from a single template (for example, many HTML pages, rendered from a template backed by data from some sort of database). A very simple example:
id:937 name=alice;
id:28 name=bob;
id:925931 name=charlie;
Given only these 3 texts, I'd like to get original template that looks like this:
"id:" + $1 + " name=" + $2 + ";"
and 3 sets of strings that were used with this template:
$1 = 937, $2 = alice
$1 = 28, $2 = bob
$1 = 925931, $3 = charlie
In other words, "template" is a list of the common subsequences encountered in all given texts always in a certain order and everything else except these subsequences should be considered "data".
I guess the general algorithm would be very similar to any LCS (longest common subsequence) algorithm, albeit with different backtracking code, that would somehow separate "template" (characters common for all given texts) and "data strings" (different characters).
Bonus question: are there ready-made solutions to do so?
I agree with the comments about the question being ill-defined. It seems likely that the format is much more specific than your general question indicates.
Having said that, something like RecordBreaker might be a help. You could also Google "wrapper induction" to see if you find some useful leads.
Perform a global multiple sequence alignment, and then call every resulting column that has a constant value part of the template:
id: 937 name=alice ;
id: 28 name=bob ;
id:925931 name=charlie;
Inferred template: XXX XXXXXX X
Most tools that I'm aware of for multiple sequence alignment require smaller alphabets -- DNA or protein -- but hopefully you can find a tool that works on the alphabet you're using (which presumably is at least all printable ASCII characters). In the worst case, you can of course implement the DP yourself: to align 2 sequences (strings) globally you use the Needleman-Wunsch algorithm, while for more than two sequences there are several approaches, the most common being sum-of-pairs scoring. The exact algorithm for k > 2 sequences unfortunately takes time exponential in k, but the heuristics employed in bioinformatics tools such as MUSCLE are much faster, and produce alignments that are very nearly as good. If they can be persuaded to work with the alphabet you're using, they would be the natural choice.

How to find "equivalent" texts?

I want to find (not generate) 2 text strings such that, after removing all non letters and ucasing, one string can be translated to the other by simple substitution.
The motivation for this comes from a project I known of that is testing methods for attacking cyphers via probability distributions. I'd like to find a large, coherent plain text that, once encrypted with a simple substitution cypher, can be decrypted to something else that is also coherent.
This ends up as 2 parts, find the longest such strings in a corpus, and get that corpus.
The first part seems to me to be amiable to some sort of attack with a B-tree keyed off the string after a substitution that makes the sequence of first occurrences sequential.
HELLOWORLDTHISISIT
1233454637819a9b98
A little optimization based on knowing the maximum value and length of the string based on each depth of the tree and the rest is just coding.
The Other part would be quite a bit more involved; how to generate a large corpus of text to search? some kind of internet spider would seem to be the ideal approach as it would have access to the largest amount of text but how to strip it to just the text?
The question is; Any ideas on how to do this better?
Edit: the cipher that was being used is an insanely basic 26 letter substitution cipher.
p.s. this is more a thought experiment then a probable real project for me.
There are 26! different substitution ciphers. That works out to a bit over 88 bits of choice:
>>> math.log(factorial(26), 2)
88.381953327016262
The entropy of English text is something like 2 bits per character at least. So it seems to me you can't reasonably expect to find passages of more than 45-50 characters that are accidentally equivalent under substitution.
For the large corpus, there's the Gutenberg Project and Wikipedia, for a start. You can download an dump of all the English Wikipedia's XML files from their website.
I think you're asking a bit much to generate a substitution that is also "coherent". That is an AI problem for the encryption algorithm to figure out what text is coherent. Also, the longer your text is the more complicated it will be to create a "coherent" result... quickly approaching a point where you need a "key" as long as the text you are encrypting. Thus defeating the purpose of encrypting it at all.

Looking for algorithm that reverses the sprintf() function output

I am working on a project that requires the parsing of log files. I am looking for a fast algorithm that would take groups messages like this:
The temperature at P1 is 35F.
The temperature at P1 is 40F.
The temperature at P3 is 35F.
Logger stopped.
Logger started.
The temperature at P1 is 40F.
and puts out something in the form of a printf():
"The temperature at P%d is %dF.", Int1, Int2"
{(1,35), (1, 40), (3, 35), (1,40)}
The algorithm needs to be generic enough to recognize almost any data load in message groups.
I tried searching for this kind of technology, but I don't even know the correct terms to search for.
I think you might be overlooking and missed fscanf() and sscanf(). Which are the opposite of fprintf() and sprintf().
Overview:
A naïve!! algorithm keeps track of the frequency of words in a per-column manner, where one can assume that each line can be separated into columns with a delimiter.
Example input:
The dog jumped over the moon
The cat jumped over the moon
The moon jumped over the moon
The car jumped over the moon
Frequencies:
Column 1: {The: 4}
Column 2: {car: 1, cat: 1, dog: 1, moon: 1}
Column 3: {jumped: 4}
Column 4: {over: 4}
Column 5: {the: 4}
Column 6: {moon: 4}
We could partition these frequency lists further by grouping based on the total number of fields, but in this simple and convenient example, we are only working with a fixed number of fields (6).
The next step is to iterate through lines which generated these frequency lists, so let's take the first example.
The: meets some hand-wavy criteria and the algorithm decides it must be static.
dog: doesn't appear to be static based on the rest of the frequency list, and thus it must be dynamic as opposed to static text. We loop through a few pre-defined regular expressions and come up with /[a-z]+/i.
over: same deal as #1; it's static, so leave as is.
the: same deal as #1; it's static, so leave as is.
moon: same deal as #1; it's static, so leave as is.
Thus, just from going over the first line we can put together the following regular expression:
/The ([a-z]+?) jumps over the moon/
Considerations:
Obviously one can choose to scan part or the whole document for the first pass, as long as one is confident the frequency lists will be a sufficient sampling of the entire data.
False positives may creep into the results, and it will be up to the filtering algorithm (hand-waving) to provide the best threshold between static and dynamic fields, or some human post-processing.
The overall idea is probably a good one, but the actual implementation will definitely weigh in on the speed and efficiency of this algorithm.
Thanks for all the great suggestions.
Chris, is right. I am looking for a generic solution for normalizing any kind of text. The solution of the problem boils down to dynmamically finding patterns in two or more similar strings.
Almost like predicting the next element in a set, based on the previous two:
1: Everest is 30000 feet high
2: K2 is 28000 feet high
=> What is the pattern?
=> Answer:
[name] is [number] feet high
Now the text file can have millions of lines and thousands of patterns. I would like to parse the files very, very fast, find the patterns and collect the data sets that are associated with each pattern.
I thought about creating some high level semantic hashes to represent the patterns in the message strings.
I would use a tokenizer and give each of the tokens types a specific "weight".
Then I would group the hashes and rate their similarity. Once the grouping is done I would collect the data sets.
I was hoping, that I didn't have to reinvent the wheel and could reuse something that is already out there.
Klaus
It depends on what you are trying to do, if your goal is to quickly generate sprintf() input, this works. If you are trying to parse data, maybe regular expressions would do too..
You're not going to find a tool that can simply take arbitrary input, guess what data you want from it, and produce the output you want. That sounds like strong AI to me.
Producing something like this, even just to recognize numbers, gets really hairy. For example is "123.456" one number or two? How about this "123,456"? Is "35F" a decimal number and an 'F' or is it the hex value 0x35F? You're going to have to build something that will parse in the way you need. You can do this with regular expressions, or you can do it with sscanf, or you can do it some other way, but you're going to have to write something custom.
However, with basic regular expressions, you can do this yourself. It won't be magic, but it's not that much work. Something like this will parse the lines you're interested in and consolidate them (Perl):
my #vals = ();
while (defined(my $line = <>))
{
if ($line =~ /The temperature at P(\d*) is (\d*)F./)
{
push(#vals, "($1,$2)");
}
}
print "The temperature at P%d is %dF. {";
for (my $i = 0; $i < #vals; $i++)
{
print $vals[$i];
if ($i < #vals - 1)
{
print ",";
}
}
print "}\n";
The output from this isL
The temperature at P%d is %dF. {(1,35),(1,40),(3,35),(1,40)}
You could do something similar for each type of line you need to parse. You could even read these regular expressions from a file, instead of custom coding each one.
I don't know of any specific tool to do that. What I did when I had a similar problem to solve was trying to guess regular expressions to match lines.
I then processed the files and displayed only the unmatched lines. If a line is unmatched, it means that the pattern is wrong and should be tweaked or another pattern should be added.
After around an hour of work, I succeeded in finding the ~20 patterns to match 10000+ lines.
In your case, you can first "guess" that one pattern is "The temperature at P[1-3] is [0-9]{2}F.". If you reprocess the file removing any matched line, it leaves "only":
Logger stopped.
Logger started.
Which you can then match with "Logger (.+).".
You can then refine the patterns and find new ones to match your whole log.
#John: I think that the question relates to an algorithm that actually recognises patterns in log files and automatically "guesses" appropriate format strings and data for it. The *scanf family can't do that on its own, it can only be of help once the patterns have been recognised in the first place.
#Derek Park: Well, even a strong AI couldn't be sure it had the right answer.
Perhaps some compression-like mechanism could be used:
Find large, frequent substrings
Find large, frequent substring patterns. (i.e. [pattern:1] [junk] [pattern:2])
Another item to consider might be to group lines by edit-distance. Grouping similar lines should split the problem into one-pattern-per-group chunks.
Actually, if you manage to write this, let the whole world know, I think a lot of us would like this tool!
#Anders
Well, even a strong AI couldn't be sure it had the right answer.
I was thinking that sufficiently strong AI could usually figure out the right answer from the context. e.g. Strong AI could recognize that "35F" in this context is a temperature and not a hex number. There are definitely cases where even strong AI would be unable to answer. Those are the same cases where a human would be unable to answer, though (assuming very strong AI).
Of course, it doesn't really matter, since we don't have strong AI. :)
http://www.logparser.com forwards to an IIS forum which seems fairly active. This is the official site for Gabriele Giuseppini's "Log Parser Toolkit". While I have never actually used this tool, I did pick up a cheap copy of the book from Amazon Marketplace - today a copy is as low as $16. Nothing beats a dead-tree-interface for just flipping through pages.
Glancing at this forum, I had not previously heard about the "New GUI tool for MS Log Parser, Log Parser Lizard" at http://www.lizardl.com/.
The key issue of course is the complexity of your GRAMMAR. To use any kind of log-parser as the term is commonly used, you need to know exactly what you're scanning for, you can write a BNF for it. Many years ago I took a course based on Aho-and-Ullman's "Dragon Book", and the thoroughly understood LALR technology can give you optimal speed, provided of course that you have that CFG.
On the other hand it does seem you're possibly reaching for something AI-like, which is a different order of complexity entirely.

Resources