I'm working on digitizing a large collection of scanned documents, working with Tesseract 3 as my OCR engine. The quality of its output is mediocre, as it often produces both garbage characters before and after the actual text, and misspellings within the text.
For the former problem, it seems like there must be strategies for determining which text is actually text and which text isn't (much of this text is things like people's names, so I'm looking for solutions other than looking up words in a dictionary).
For the typo problem, most of the errors stem from a few misclassifications of letters (substituting l, 1, and I for one another, for instance), and it seems like there should be methods for guessing which words are misspelled (since not too many words in English have a "1" in the middle of them), and guessing what the appropriate correction is.
What are the best practices in this space? Are there free/open-source implementations of algorithms that do this sort of thing? Google has yielded lots of papers, but not much concrete. If there aren't implementations available, which of the many papers would be a good starting place?
For "determining which text is actually text and which text isn't" you might want to look at rmgarbage from same department that developed Tesseract (the ISRI). I've written a Perl implementation and there's also a Ruby implementation. For the 1 vs. l problem I'm experimenting with ocrspell (again from the same department), for which their original source is available.
I can only post two links, so the missing ones are:
ocrspell: enter "10.1007/PL00013558" at dx.doi.org]
rmgarbage: search for "Automatic Removal of Garbage Strings in OCR Text: An Implementation"
ruby implementation: search for "docsplit textcleaner"
Something that could be useful for you is to try this free online OCR and compare its results with yours to see if by playing with the image (e.g. scaling up/down) you could improve the results.
I was using it as an "upper bound" of the results I should get when using tesseract myself (after using OpenCV to modify the images).
Related
I want to generate Keywords for my CMS.
Does someone know a good PHP Script (or something else) which generates keywords?
I have a HTML Site like this: http://pastebin.com/ZU8vdyeP
This is a very hard problem for a computer to solve. It would be much easier to get somebody (else?) to do it manually, or simply not do it at all.
If you'd really need a computer to do it, I'd head over to the excellent Python library NLTK which has many tools for this sort of thing (=natural language processing), and it's a lot of fun to work with.
For example, you could calculate a frequency distribution of the words, and then search for the most common hypernyms of larger (above say 5 char) words that appear most frequently and use that as a hint of what the keywords could be.
Again, it is much easier to get it done by a human, however.
to automate, get the words from the article, match them against a blacklist and dont include words under 4 chars.
Additionally, Let user manually edit. So only automate if no present keywords.
This can be done by trigger or application layer.
regards,
/t
If I understand the problem, you have text and you want to determine keywords that are most relevant to the text.
Three approaches:
1) Have user enter keywords
2) Statistical analysis of text, for example determine the words that are far more common in the text than they are in the language overall. Any good text on Information Retrieval will have some algorithms.
3) If you have a set of documents that are already classified (perhaps previously classified by humans) then you can use a machine learning algorithm (perhaps a Bayesian classifier) to train the system to classify the new documents. If you let the users override/correct the suggested keywords, the system can learn over time.
Personally, I'd do #3, since it is more adaptive.
The problem
I am trying to improve the result of an OCR process by combining the output from three different OCR systems (tesseract, cuneinform, ocrad).
I already do image preprocessing (deskewing, despeckling, threholding and some more). I don't think that this part can be improved much more.
Usually the text to recognize is between one and 6 words long. The lanuage of the text is unknown and quite often they contain fantasy words.
I am on Linux. Preferred language would be Python.
What I have so far
Often every result has one or two errors. But they have errors at different characters/positions. Errors could be that they recognize a wrong character or that they include a non existing character. Not so often they ignore a character.
An example might look in the following way:
Xorem_ipsum
lorXYm_ipsum
lorem_ipuX
A X is a wrong recognized character and an Y is a character which does not exist in the text. Spaces are replaced by "_" for better readibilty.
In cases like this I try to combine the different results.
Using repeatedly the "longest common substring" algorithm between the three pairs I am able to get the following structure for the given example
or m_ipsum
lor m_ip u
orem_ip u
But here I am stuck now. I am not able to combine those pieces to a result.
The questions
Do you have
an idea how to combine the different
common longest substrings?
Or do you have a better idea how to solve this problem?
It all depends on the OCR engines you are using as to the quality of the results you can expect to get. You may find that by choosing a higher quality OCR engine that gives you confidence levels and bounding boxes would give you much better raw results in the first place and then extra information that could be used to determine the correct result.
Using Linux will restrict the possible OCR engines available to you. Personally I would rate Tesseract as 6.5/10 compared to commercial OCR engines available under Windows.
http://www.abbyy.com/ocr_sdk_linux/overview/ - The SDK may not be cheap though.
http://irislinktest.iriscorporate.com/c2-1637-189/iDRS-14-------Recognition--Image-preprocessing--Document-formatting-and-more.aspx - Available for Linux
http://www.rerecognition.com/ - Is available as a Linux version. This engine is used by many other companies.
All of the engines above should give you confidence levels, bounding boxes and better results than Tesseract OCR.
https://launchpad.net/cuneiform-linux - Cuneiform, now open sourced and running under Linux. This is likely one of your three engnines you are using. If not you should probably look at adding it.
Also you may want to look at http://tev.fbk.eu/OCR/Products.html for more options.
Can you past a sample or two of typical images and the OCR results from the engines. There are other ways to improve OCR recognition but it would depend on the images.
Maybe repeat the "longest common substring" until all results are the same.
For your example, you would get the following in the next step:
or m_ip u
or m_ip u
or m_ip u
OR do the "longest common substring" algorithm with the first and second string and then again the result with the third string. So you get the same result or m_ip u more easy.
So you can assume that letters should be correct. Now look at the spaces. Before or there are two times l and once X, so choose l. Between or and m_ip there are two times e and once XY, so choose e. And so on.
I'm new to OCR, but until now I find out that those systems are build to work based on a dictionary of words rather than letter by letter. So, if your images doesn't have real words, maybe you will have to look closer to the letter recognition & training part of the systems you are using.
I afforded a very similar problem.
I hope that this can help: http://dl.tufts.edu/catalog/tufts:PB.001.011.00001
See also software developed by Bruce Robertson: https://github.com/brobertson/rigaudon
I am very new to OCR and almost know nothing about the algorithms used to recognize words. I am just getting familiar to that.
Could anybody please advise on the typical method used to recognize and separate individual characters in connected form (I mean in a word where all letters are linked together)? Forget about handwriting, supposing the letters are connected together using a known font, what is the best method to determine each individual character in a word? When characters are written separately there is no problem, but when they are joined together, we should know where every single character starts and ends in order to go to the next step and match them individually to a letter.
Is there any known algorithm for that?
The standard term for this process is "character segmentation" - segmentation is the image processing term for breaking images into grouped areas for recognition. "Arabic character segmentation" throws up a lot of hits in google scholar if you want to learn more.
I'd encourage you to look at Tesseract - an open source OCR implementation, especially the documents.
Feature as defined in the glossary has a bit on this, but there is a ton of information here.
Basically Tesseract solves the problem (from How Tesseract Works) by looking at blobs (not letters) then combining those blobs into words. This avoids the problem you describe, while creating new problems.
For arabic (as you point out) Tesseract doesn't work. I don't know much about this area but this paper seems to imply Dynamic Time Warping (DTW) is a useful technique. This tries to stretch the words to match them to known words, and again works in word rather than letter space.
I have implemented a full text search in a discussion forum database and I want to display
the search results in a way Google does. Even for a very long html page only a two or three
lines of the texts displayed in a search result list. Usually these are the lines
which contain a search terms.
What would be the good algorithm of how to extract a few lines of the text based on the text itself and a search terms. I could think of something as easy as just using one line of text before the search term occurrence in a text and a line after - but that seems to be too simple to work.
Would like to get a few directions, ideas and insights.
Thank you.
If you are looking for something fancier than the 'line before/after' approach, a summarizer might do the trick.
Here's a Naive Bayes based system: http://classifier4j.sourceforge.net/
Bayes is the statistical system used by many spam filters - I researched Bayes summarizers a few years back, and found that they do a pretty good job of summarizing text, as long as there is a decent amount of text to process. I haven't actually tried the above library, though, so your mileage may vary.
Have you tried the "line before/after search term occurrance" in code to see if for that simple coding investment the results are good enough for what you want? Might already be enough?
Otherwise, you could go for pieces of sentences: so don't split on lines, but on newlines, full stops, comma's, spaced out hyphens etc. Then show the pieces that contain the search terms. You could separate each matching sentence piece with "..." or something.
If you get a lot of these pieces, you could try to prioritize the pieces, sort on descending priority and only show the first n of them. And/or cut down the pieces to just the search term and a couple of words around the search term.
Just a couple of informal ideas that might get you started?
Concentrate on the beginning of the content. Think of where you would look when you visit a blog. The beginning para tells you whether the article is in the right direction. So in your algorithm it will make sense to reflect this.
Check for occurrences of the search term in headings (H1,H2 etc) and give more priority to them.
This should get you started.
Here's the problem. I have a bunch of large text files with paragraphs and paragraphs of written matter. Each para contains references to a few people (names), and documents a few topics (places, objects).
How do I data mine this pile to assemble some categorised library? ... in general, 2 things.
I don't know what I'm looking for, so I need a program to get the most used words/multiple words ("Jacob Smith" or "bluewater inn" or "arrow").
Then knowing the keywords, I need a program to help me search for related paras, then sort and refine results (manually by hand).
Your question is a tiny bit open-ended :)
Chances are, you will find modules for whatever analysis you want to do in the UIMA framework:
Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.
UIMA is made of many things
UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.
You may also find Open Calais a useful API for text analysis; depending on how big your heap of documents is, it may be more or less appropriate.
If you want it quick and dirty -- create an inverted index that stores all locations of words (basically a big map of words to all file ids in which they occur, paragraphs in those files, lines in the paragraphs, etc). Also index tuples so that given a fileid and paragraph you can look up all the neighbors. This will do what you describe, but it takes quite a bit of tweaking to get it to pull up meaningful correlations (some keywords to start you off on your search: information retrieval, TF-IDF, Pearson correlation coefficient).
Looks like you're trying to create an index?
I think Learning Perl has information on finding the frequency of words in a text file, so that's not a particularly hard problem.
But do you really want to know that "the" or "a" is the most common word?
If you're looking for some kind of topical index, the words you actually care about are probably down the list a bit, intermixed with more words you don't care about.
You could start by getting rid of "stop words" at the front of the list to filter your results a bit, but nothing would beat associating keywords that actually reflect the topic of the paragraphs, and that requires context.
Anyway, I could be off base, but there you go. ;)
The problem with what you ask is that you don't know what you're looking for. If you had some sort of weighted list of terms that you cared about, then you'd be in good shape.
Semantically, the problem is twofold:
Generally the most-used words are the least relevant. Even if you use a stop-words file, a lot of chaff remains
Generally, the least-used words are the most relevant. For example, "bluewater inn" is probably infrequent.
Let's suppose that you had something that did what you ask, and produced a clean list of all the keywords that appear in your texts. There would be thousands of such keywords. Finding "bluewater inn" in a list of 1000s of terms is actually harder than finding it in the paragraph (assuming you don't know what you're looking for) because you can skim the texts and you'll find the paragraph that contains "bluewater inn" because of its context, but you can't find it in a list because the list has no context.
Why don't you talk more about your application and process and then perhaps we can help you better??
I think what you want to do is called "entity extraction". This Wikipedia article has a good overview and a list of apps, including open source ones. I used to work on one of the commercial tools in the list, but not in a programming capacity, so I can't help you there.
Ned Batchelder gave a great talk at DevDays Boston about Python.
He presented a spell-corrector written in Python that does pretty much exactly what you want.
You can find the slides and source code here:
http://nedbatchelder.com/text/devdays.html
I recommend that you have a look at R. In particular, look at the tm package. Here are some relevant links:
Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel
mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Package homepage: http://cran.r-project.org/web/packages/tm/index.html
Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
More generally, there are a large number of text mining packages on the Natural Language Processing view on CRAN.