Detecting "noise" in text extracted from documents - ruby

I am working on retrieving the readable content (i.e. text) from PDF documents, most of which are scientific journal articles.
I am using the Poppler text utilities to convert the PDF to text format.
The text is extracted nicely, but unfortunately so are other components of the articles (e.g. numerical tables), which cannot be rendered properly in plain text.
For example, I might get the following output in the middle of the article:
Character distributions random Hmax
1 2 3 4
Organization c) (of characters over species
A
B
A 0 0 0 + C
B + + + +
C + + + + A
B 4+
H Character distributions nonrandom Hobs
Entropy
3+ 2+ 1+
(diversity of characters over species
My question is: how would I identify such "noise" and differentiate it from normal blocks of text? Are there any existing algorithms? I am working in Ruby, but code in any language will help.

You could use a Naive Bayes Classifier to model valid vs. non-valid lines.
Here's an article on one in Ruby; there's a good implementation in Python's nltk.
To set it up you would need to give it examples, for example by filling one file with good lines and one with bad ones. This is the same model used by spam filters.
One trick for this use case is that many basic Naive Bayes Classifiers word using a word-occurrence model for features, whereas here it's not the vocabulary that's significant. You may with to use line length, percent spaces (rounded to 5% or 10% intervals), or percent of various punctuation marks (rounded but with higher precision). Hopefully your classifier will learn that "lines with no periods and 30% spaces are bad" or "lines with no punctuation where every word begins with a capital letter are bad".
Based on just your examples above, though, you could probably reject any line with too high a ratio of spaces or those completely lacking in sentence punctuation such as commas and periods.

Related

Better algorithm for shortening English words

I have some unique codes that are generated from strings (ex: website host names) in various independent components of my application.
These codes are meant to be used by machines only so i would like to keep them as short as possible.
The below algorithm would be applied to every word in the string. The output words would be concatenated with a dash to generate the unique code.
The current algorithm I have used:
- Skip word if length is less than 6
- Leave first character as is
- Remove every wowel in the word from the second character onwards
architectural digest eu => archtctrl-dgst-eu
arizona foothills magazine => arzn-fthlls-mgzn
Is there a better way to shorten an English word leaving it as recognisable as possible to a human reader?
The output should be deterministic and produce the same shortened version whenever it is run on the same input.
A good algorithm should also minimise the number of clashes for similarly spelt words.
I have some unique codes that are generated from strings
I am afraid that is not true. There are many English words that will reduce to the same 'code word' when stripped of their vowels. For example, 'leaving' -> 'living' Given, this is fairly rare, it could still cause issues.
How important is it that these 'code words' remain human-readable if as you say, they are meant to be used by machines only? If its not that important, I'd suggest looking into some simpler compression algorithms like Huffman Coding or LZW Compression. Then if the user needs to see the translation of the code word, just uncompress it.
If you must keep it human-readable, I'm not sure that there is much more you can do to shorten it. You could take a look at specific latin + greek roots, and determine if you can shorten those any more by hand, and then just substitute those out automatically.
Alternatively, you could turn to a phonetic approach. Automatically search the pronunciation of the word, and then see if that is any shorter (or itself can be compressed, taking 'cee' to 'C', or 'kay' to 'K'). This would be much more time and CPU intensive, but its still an option if you really, really need short but yet readable codes.
What you're generating sounds like what's called a "slug". There are many libraries to handle this for blogs or site generators that should suit your purposes. Here's a usage example from a Python library called slugify:
txt = "___This is a test ---"
r = slugify(txt)
self.assertEqual(r, "this-is-a-test")
Slug libraries generally work like this:
replacing non-ascii linguistic characters via a mapping (ex: 影師嗎 -> ying-shi-ma)
replace accented latin letters with ascii equivalents via a mapping (ex: C'est déjà l'été. -> c-est-deja-l-ete)
remove beginning and trailing spaces/punctuation
convert remaining spaces and punctuation to dashes, collapsing multiple dashes in a row to a single dash
If you want to make slugs shorter you could remove vowels or, more simply, use a maximum length.

What algorithms can group characters into words?

I have some text generated by some lousy OCR software.
The output contains mixture of words and space-separated characters, which should have been grouped into words. For example,
Expr e s s i o n Syntax
S u m m a r y o f T e r minology
should have been
Expression Syntax
Summary of Terminology
What algorithms can group characters into words?
If I program in Python, C#, Java, C or C++, what libraries provide the implementation of the algorithms?
Thanks.
Minimal approach:
In your input, remove the space before any single letter words. Mark the final words created as part of this somehow (prefix them with a symbol not in the input, for example).
Get a dictionary of English words, sorted longest to shortest.
For each marked word in your input, find the longest match and break that off as a word. Repeat on the characters left over in the original "word" until there's nothing left over. (In the case where there's no match just leave it alone.)
More sophisticated, overkill approach:
The problem of splitting words without spaces is a real-world problem in languages commonly written without spaces, such as Chinese and Japanese. I'm familiar with Japanese so I'll mainly speak with reference to that.
Typical approaches use a dictionary and a sequence model. The model is trained to learn transition properties between labels - part of speech tagging, combined with the dictionary, is used to figure out the relative likelihood of different potential places to split words. Then the most likely sequence of splits for a whole sentence is solved for using (for example) the Viterbi algorithm.
Creating a system like this is almost certainly overkill if you're just cleaning OCR data, but if you're interested it may be worth looking into.
A sample case where the more sophisticated approach will work and the simple one won't:
input: Playforthefunofit
simple output: Play forth efunofit (forth is longer than for)
sophistiated output: Play for the fun of it (forth efunofit is a low-frequency - that is, unnatural - transition, while for the is not)
You can work around the issue with the simple approach to some extent by adding common short-word sequences to your dictionary as units. For example, add forthe as a dictionary word, and split it in a post processing step.
Hope that helps - good luck!

Counting words from a mixed-language document

Given a set of lines containing Chinese characters, Latin-alphabet-based words or a mixture of both, I wanted to obtain the word count.
To wit:
this is just an example
这只是个例子
should give 10 words ideally; but of course, without access to a dictionary, 例子 would best be treated as two separate characters. Therefore, a count of 11 words/characters would also be an acceptable result here.
Obviously, wc -w is not going to work. It considers the 6 Chinese characters / 5 words as 1 "word", and returns a total of 6.
How do I proceed? I am open to trying different languages, though bash and python will be the quickest for me right now.
You should split the text on Unicode word boundaries, then count the elements which contain letters or ideographs. If you're working with Python, you could use the uniseg or nltk packages, for example. Another approach is to simply use Unicode-aware regexes but these will only break on simple word boundaries. Also see the question Split unicode string on word boundaries.
Note that you'll need a more complex dictionary-based solution for some languages. UAX #29 states:
For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.
I thought about a quick hack since Chinese characters are 3 bytes long in UTF8:
(pseudocode)
for each character:
if character (byte) begins with 1:
add 1 to total chinese chars
if it is a space:
add 1 to total "normal" words
if it is a newline:
break
Then take total chinese chars / 3 + total words to get the sum for each line. This will give an erroneous count for the case of mixed languages, but should be a good start.
这是test
However, the above sentence will give a total of 2 (1 for each of the Chinese characters.) A space between the two languages would be needed to give the correct count.

How to neglect the output of OCR Engine that has no meaning?

Tesseract OCR engine sometimes outputs text that has no meaning, i want to design an algorithm that neglects any text or word that has no meaning, below is some sort of output text that i want to neglect,my simple solution is to count the words in the recognized text that's separated by " " and the text which has too many words will be garbage(Hint: i'm scanning images which at most will contains 40 words) any idea will be helpful,thanks.
wo:>"|axnoA1wvw\
ldflfig
°J!9O‘ !P99W M9N 6 13!-|15!Cl ‘I-/Vl
978 89l9 Z0 3+ 3 'l9.l.
97 999 VLL lLOZ+ 3 9l!q°lN
wo0'|axno/(#|au1e>1e: new;
1=96r2a1ey\1 1uauud0|e/\e(]
|8UJB){ p8UJL|\7'
Divide the output text into words. Divide the words into triples. Count the triple frequencies, and compare to triple frequencies from text of a known-good text corpus (EG all the articles from some mailing list discussing what you intend to OCR, minus the header lines).
When I say "triples", I mean:
whe, hen, i, say, tri, rip, ipl, ple, les, i, mea, ean
...so "i" has a frequency of 2 in this short example, while the others are all frequency 1.
If you do a frequency count of each of these triples for a large document in your intended language, it should become possible to be reasonably accurate in guessing whether a string is in the same language.
Granted, it's heuristic.
I've used a similar approach for detecting English passwords in a password changing program. It worked pretty well, though there's no such thing as a perfect "obvious password rejecter".
Check the words against a dictionary?
Of course, this will have false-positives for things like foreign-phrases or code. The problem in general is intractable (ex. is this code or gibberish? :) ). The only (nearly) perfect method would be to use this as a heuristic to flag certain sections for human review.

decoding algorithm wanted

I receive encoded PDF files regularly. The encoding works like this:
the PDFs can be displayed correctly in Acrobat Reader
select all and copy the test via Acrobat Reader
and paste in a text editor
will show that the content are encoded
so, examples are:
13579 -> 3579;
hello -> jgnnq
it's basically an offset (maybe swap) of ASCII characters.
The question is how can I find the offset automatically when I have access to only a few samples. I cannot be sure whether the encoding offset is changed. All I know is some text will usually (if not always) show up, e.g. "Name:", "Summary:", "Total:", inside the PDF.
Thank you!
edit: thanks for the feedback. I'd try to break the question into smaller questions:
Part 1: How to detect identical part(s) inside string?
You need to brute-force it.
If those patterns are simple like +2 character code like in your examples (which is +2 char codes)
h i j
e f g
l m n
l m n
o p q
1 2 3
3 4 5
5 6 7
7 8 9
9 : ;
You could easily implement like this to check against knowns words
>>> text='jgnnq'
>>> knowns=['hello', '13579']
>>>
>>> for i in range(-5,+5): #check -5 to +5 char code range
... rot=''.join(chr(ord(j)+i) for j in text)
... for x in knowns:
... if x in rot:
... print rot
...
hello
Is the PDF going to contain symbolic (like math or proofs) or natural language text (English, French, etc)?
If the latter, you can use a frequency chart for letters (digraphs, trigraphs and a small dictionary of words if you want to go the distance). I think there are probably a few of these online. Here's a start. And more specifically letter frequencies.
Then, if you're sure it's a Caesar shift, you can grab the first 1000 characters or so and shift them forward by increasing amounts up to (I would guess) 127 or so. Take the resulting texts and calculate how close the frequencies match the average ones you found above. Here is information on that.
The linked letter frequencies page on Wikipedia shows only letters, so you may want to exclude them in your calculation, or better find a chart with them in it. You may also want to transform the entire resulting text into lowercase or uppercase (your preference) to treat letters the same regardless of case.
Edit - saw comment about character swapping
In this case, it's a substitution cipher, which can still be broken automatically, though this time you will probably want to have a digraph chart handy to do extra analysis. This is useful because there will quite possibly be a substitution that is "closer" to average language in terms of letter analysis than the correct one, but comparing digraph frequencies will let you rule it out.
Also, I suggested shifting the characters, then seeing how close the frequencies matched the average language frequencies. You can actually just calculate the frequencies in your ciphertext first, then try to line them up with the good values. I'm not sure which is better.
Hmmm, thats a tough one.
The only thing I can suggest is using a dictionary (along with some substitution cipher algorithms) may help in decoding some of the text.
But I cannot see a solution that will decode everything for you with the scenario you describe.
Why don't you paste some sample input and we can have ago at decoding it.
It's only possible then you have a lot of examples (examples count stops then: possible to get all the combinations or just an linear values dependency or idea of the scenario).
also this question : How would I reverse engineer a cryptographic algorithm? have some advices.
Do the encoded files open correctly in PDF readers other than Acrobat Reader? If so, you could just use a PDF library (e.g. PDF Clown) and use it to programmatically extract the text you need.

Resources