How to neglect the output of OCR Engine that has no meaning? - algorithm

Tesseract OCR engine sometimes outputs text that has no meaning, i want to design an algorithm that neglects any text or word that has no meaning, below is some sort of output text that i want to neglect,my simple solution is to count the words in the recognized text that's separated by " " and the text which has too many words will be garbage(Hint: i'm scanning images which at most will contains 40 words) any idea will be helpful,thanks.
wo:>"|axnoA1wvw\
ldflfig
°J!9O‘ !P99W M9N 6 13!-|15!Cl ‘I-/Vl
978 89l9 Z0 3+ 3 'l9.l.
97 999 VLL lLOZ+ 3 9l!q°lN
wo0'|axno/(#|au1e>1e: new;
1=96r2a1ey\1 1uauud0|e/\e(]
|8UJB){ p8UJL|\7'

Divide the output text into words. Divide the words into triples. Count the triple frequencies, and compare to triple frequencies from text of a known-good text corpus (EG all the articles from some mailing list discussing what you intend to OCR, minus the header lines).
When I say "triples", I mean:
whe, hen, i, say, tri, rip, ipl, ple, les, i, mea, ean
...so "i" has a frequency of 2 in this short example, while the others are all frequency 1.
If you do a frequency count of each of these triples for a large document in your intended language, it should become possible to be reasonably accurate in guessing whether a string is in the same language.
Granted, it's heuristic.
I've used a similar approach for detecting English passwords in a password changing program. It worked pretty well, though there's no such thing as a perfect "obvious password rejecter".

Check the words against a dictionary?
Of course, this will have false-positives for things like foreign-phrases or code. The problem in general is intractable (ex. is this code or gibberish? :) ). The only (nearly) perfect method would be to use this as a heuristic to flag certain sections for human review.

Related

What algorithms can group characters into words?

I have some text generated by some lousy OCR software.
The output contains mixture of words and space-separated characters, which should have been grouped into words. For example,
Expr e s s i o n Syntax
S u m m a r y o f T e r minology
should have been
Expression Syntax
Summary of Terminology
What algorithms can group characters into words?
If I program in Python, C#, Java, C or C++, what libraries provide the implementation of the algorithms?
Thanks.
Minimal approach:
In your input, remove the space before any single letter words. Mark the final words created as part of this somehow (prefix them with a symbol not in the input, for example).
Get a dictionary of English words, sorted longest to shortest.
For each marked word in your input, find the longest match and break that off as a word. Repeat on the characters left over in the original "word" until there's nothing left over. (In the case where there's no match just leave it alone.)
More sophisticated, overkill approach:
The problem of splitting words without spaces is a real-world problem in languages commonly written without spaces, such as Chinese and Japanese. I'm familiar with Japanese so I'll mainly speak with reference to that.
Typical approaches use a dictionary and a sequence model. The model is trained to learn transition properties between labels - part of speech tagging, combined with the dictionary, is used to figure out the relative likelihood of different potential places to split words. Then the most likely sequence of splits for a whole sentence is solved for using (for example) the Viterbi algorithm.
Creating a system like this is almost certainly overkill if you're just cleaning OCR data, but if you're interested it may be worth looking into.
A sample case where the more sophisticated approach will work and the simple one won't:
input: Playforthefunofit
simple output: Play forth efunofit (forth is longer than for)
sophistiated output: Play for the fun of it (forth efunofit is a low-frequency - that is, unnatural - transition, while for the is not)
You can work around the issue with the simple approach to some extent by adding common short-word sequences to your dictionary as units. For example, add forthe as a dictionary word, and split it in a post processing step.
Hope that helps - good luck!

Detecting "noise" in text extracted from documents

I am working on retrieving the readable content (i.e. text) from PDF documents, most of which are scientific journal articles.
I am using the Poppler text utilities to convert the PDF to text format.
The text is extracted nicely, but unfortunately so are other components of the articles (e.g. numerical tables), which cannot be rendered properly in plain text.
For example, I might get the following output in the middle of the article:
Character distributions random Hmax
1 2 3 4
Organization c) (of characters over species
A
B
A 0 0 0 + C
B + + + +
C + + + + A
B 4+
H Character distributions nonrandom Hobs
Entropy
3+ 2+ 1+
(diversity of characters over species
My question is: how would I identify such "noise" and differentiate it from normal blocks of text? Are there any existing algorithms? I am working in Ruby, but code in any language will help.
You could use a Naive Bayes Classifier to model valid vs. non-valid lines.
Here's an article on one in Ruby; there's a good implementation in Python's nltk.
To set it up you would need to give it examples, for example by filling one file with good lines and one with bad ones. This is the same model used by spam filters.
One trick for this use case is that many basic Naive Bayes Classifiers word using a word-occurrence model for features, whereas here it's not the vocabulary that's significant. You may with to use line length, percent spaces (rounded to 5% or 10% intervals), or percent of various punctuation marks (rounded but with higher precision). Hopefully your classifier will learn that "lines with no periods and 30% spaces are bad" or "lines with no punctuation where every word begins with a capital letter are bad".
Based on just your examples above, though, you could probably reject any line with too high a ratio of spaces or those completely lacking in sentence punctuation such as commas and periods.

decoding algorithm wanted

I receive encoded PDF files regularly. The encoding works like this:
the PDFs can be displayed correctly in Acrobat Reader
select all and copy the test via Acrobat Reader
and paste in a text editor
will show that the content are encoded
so, examples are:
13579 -> 3579;
hello -> jgnnq
it's basically an offset (maybe swap) of ASCII characters.
The question is how can I find the offset automatically when I have access to only a few samples. I cannot be sure whether the encoding offset is changed. All I know is some text will usually (if not always) show up, e.g. "Name:", "Summary:", "Total:", inside the PDF.
Thank you!
edit: thanks for the feedback. I'd try to break the question into smaller questions:
Part 1: How to detect identical part(s) inside string?
You need to brute-force it.
If those patterns are simple like +2 character code like in your examples (which is +2 char codes)
h i j
e f g
l m n
l m n
o p q
1 2 3
3 4 5
5 6 7
7 8 9
9 : ;
You could easily implement like this to check against knowns words
>>> text='jgnnq'
>>> knowns=['hello', '13579']
>>>
>>> for i in range(-5,+5): #check -5 to +5 char code range
... rot=''.join(chr(ord(j)+i) for j in text)
... for x in knowns:
... if x in rot:
... print rot
...
hello
Is the PDF going to contain symbolic (like math or proofs) or natural language text (English, French, etc)?
If the latter, you can use a frequency chart for letters (digraphs, trigraphs and a small dictionary of words if you want to go the distance). I think there are probably a few of these online. Here's a start. And more specifically letter frequencies.
Then, if you're sure it's a Caesar shift, you can grab the first 1000 characters or so and shift them forward by increasing amounts up to (I would guess) 127 or so. Take the resulting texts and calculate how close the frequencies match the average ones you found above. Here is information on that.
The linked letter frequencies page on Wikipedia shows only letters, so you may want to exclude them in your calculation, or better find a chart with them in it. You may also want to transform the entire resulting text into lowercase or uppercase (your preference) to treat letters the same regardless of case.
Edit - saw comment about character swapping
In this case, it's a substitution cipher, which can still be broken automatically, though this time you will probably want to have a digraph chart handy to do extra analysis. This is useful because there will quite possibly be a substitution that is "closer" to average language in terms of letter analysis than the correct one, but comparing digraph frequencies will let you rule it out.
Also, I suggested shifting the characters, then seeing how close the frequencies matched the average language frequencies. You can actually just calculate the frequencies in your ciphertext first, then try to line them up with the good values. I'm not sure which is better.
Hmmm, thats a tough one.
The only thing I can suggest is using a dictionary (along with some substitution cipher algorithms) may help in decoding some of the text.
But I cannot see a solution that will decode everything for you with the scenario you describe.
Why don't you paste some sample input and we can have ago at decoding it.
It's only possible then you have a lot of examples (examples count stops then: possible to get all the combinations or just an linear values dependency or idea of the scenario).
also this question : How would I reverse engineer a cryptographic algorithm? have some advices.
Do the encoded files open correctly in PDF readers other than Acrobat Reader? If so, you could just use a PDF library (e.g. PDF Clown) and use it to programmatically extract the text you need.

How to find "equivalent" texts?

I want to find (not generate) 2 text strings such that, after removing all non letters and ucasing, one string can be translated to the other by simple substitution.
The motivation for this comes from a project I known of that is testing methods for attacking cyphers via probability distributions. I'd like to find a large, coherent plain text that, once encrypted with a simple substitution cypher, can be decrypted to something else that is also coherent.
This ends up as 2 parts, find the longest such strings in a corpus, and get that corpus.
The first part seems to me to be amiable to some sort of attack with a B-tree keyed off the string after a substitution that makes the sequence of first occurrences sequential.
HELLOWORLDTHISISIT
1233454637819a9b98
A little optimization based on knowing the maximum value and length of the string based on each depth of the tree and the rest is just coding.
The Other part would be quite a bit more involved; how to generate a large corpus of text to search? some kind of internet spider would seem to be the ideal approach as it would have access to the largest amount of text but how to strip it to just the text?
The question is; Any ideas on how to do this better?
Edit: the cipher that was being used is an insanely basic 26 letter substitution cipher.
p.s. this is more a thought experiment then a probable real project for me.
There are 26! different substitution ciphers. That works out to a bit over 88 bits of choice:
>>> math.log(factorial(26), 2)
88.381953327016262
The entropy of English text is something like 2 bits per character at least. So it seems to me you can't reasonably expect to find passages of more than 45-50 characters that are accidentally equivalent under substitution.
For the large corpus, there's the Gutenberg Project and Wikipedia, for a start. You can download an dump of all the English Wikipedia's XML files from their website.
I think you're asking a bit much to generate a substitution that is also "coherent". That is an AI problem for the encryption algorithm to figure out what text is coherent. Also, the longer your text is the more complicated it will be to create a "coherent" result... quickly approaching a point where you need a "key" as long as the text you are encrypting. Thus defeating the purpose of encrypting it at all.

Algorithm to estimate number of English translation words from Japanese source

I'm trying to come up with a way to estimate the number of English words a translation from Japanese will turn into. Japanese has three main scripts -- Kanji, Hiragana, and Katakana -- and each has a different average character-to-word ratio (Kanji being the lowest, Katakana the highest).
Examples:
computer: コンピュータ (Katakana - 6
characters); 計算機 (Kanji: 3
characters)
whale: くじら (Hiragana --
3 characters); 鯨 (Kanji: 1
character)
As data, I have a large glossary of Japanese words and their English translations, and a fairly large corpus of matched Japanese source documents and their English translations. I want to come up with a formula that will count numbers of Kanji, Hiragana, and Katakana characters in a source text, and estimate the number of English words this is likely to turn into.
Here's what Borland (now Embarcadero) thinks about English to non-English:
Length of English string (in characters)
Expected increase
1-5 100%
6-12 80%
13-20 60%
21-30 40%
31-50 20%
over 50 10%
I think you can sort of apply this (with some modification) for Japanese to non-Japanese.
Another element you might want to consider is the tone of the language. In English, instructions are phrased as an imperative as in "Press OK." But in Japanese language, imperatives are considered rude, and you must phrase instructions in honorific (or keigo) as in "OKボタンを押してください。"
Watch out for three-letter kanji combos. Many of the big words translate into three- or four- letter kanji combo such as 国際化(internationalization: 20 chars), 高可用性(high availability: 17 chars).
I would start with linear approximation: approx_english_words = a1*no_characters_in_script1 + a2 * no_chars_in_script2 + a3 * no_chars_in_script3, with the coefficients a1, a2, a3 fit from your data using linear least squares.
If this doesn't approximate very well, then look at the worst cases for the reasons they don't fit (specialized words, etc.).
In my experience as a translator and localization specialist, a good rule of thumb is 2 Japanese characters per English word.
As an experienced translator between Japanese and English, I can say that this is extremely difficult to quantify, but typically in my experience English text translated from Japanese is nearly 200% as many characters as the source text. In Japanese there are many culturally specific phrases and nouns that can't be translated literally and need to be explained in English.
When translating it is not unusual for me to take a single Japanese sentence and to make a single English paragraph out of it in order for the meaning to be communicated to the reader. Off the top of my here is an example:
「懐かしい」
This literally means nostalgic. However, in Japanese it can be used as a single phrase in an exclamation. Yet, in English in order to convey a feeling of nostalgia we require a lot more context. For instance, you may need to turn that single phrase into a sentence:
"As I walked by my old elementary school, I was flooded with memories of the past."
This is why machine translation between Japanese and English is impossible.
Well, it's a little more complex than just the number of characters in a noun compared to English, for instance, Japanese also has a different grammatical structure compared to English, so certain sentences would use MORE words in Japanese, and others would use LESS words. I don't really know Japanese, so please forgive me for using Korean as an example.
In Korean, a sentence is often shorter than an English sentence, due mainly to the fact that they are cut short by using context to fill in the missing words. For instance, saying "I love you" could be as short as 사랑해 ("sarang hae", simply the verb "love"), or as long as the fully qualified sentence 저는 당신을 살앙해요 (I [topic] you [object] love [verb + polite modifier]. In a text how it is written depends on context, which is usually set by earlier sentences in the paragraph.
Anyway, having an algorithm to actually KNOW this kind of thing would be very difficult, so you're probably much better off, just using statistics. What you should do is use random samples where the known Japanese texts, and English texts have the same meaning. The larger the sample (and the more random it is) the better... though if they are truly random, it won't make much difference how many you have past a few hundred.
Now, another thing is this ratio would change completely on the type of text being translated. For instance, highly technical document is quite likely to have a much higher Japanese/English length ratio than a soppy novel.
As for simply using your dictionary of word to word translations - that probably won't work to well (and is probably wrong). The same word does not translate to the same word every time in a different language (although much more likely to happen in technical discussions). For instance, the word beautiful. There is not only more than one word I could assign it to in Korean (i.e. there is a choice), but sometimes I lose that choice, as in the sentence (that food is beautiful), where I don't mean the food looks good. I mean it tastes good, and my option of translations for that word changes. And this is a VERY common circumstance.
Another big problem is optimal translation. Something that human's are really bad at, and something that computers are much much worse at. Whenever I've proofread a document translated from another text to English, I can always see various ways to cut it much much shorter.
So although, with statistics, you would be able to work out a pretty good average ratio in length between translations, this will be far different than it would be were all translations to be optimal.
It seems simple enough - you just need to find out the ratios.
For each script, count the number of script characters and English words in your glossary and work out the ratio.
This can be augmented with the Japanese source documents assuming you can both detect which script a Japanese word is in and what the English equivalent phrase is in the translation. Otherwise you'll have to guesstimate the ratios or ignore this as source data,
Then, as you say, count the number of words in each script of your source text, do the multiplies, and you should have a rough estimate.
My (albeit tiny) experience seems to indicate that, no matter what the language, blocks of text take the same amount of printed space to convey equivalent information. So, for a large-ish block of text, you could assign a width count to each character in English (grab this from a common font like Times New Roman), and likewise use a common Japanese font at the same point size to calculate the number of characters that would be required.

Resources