Extracting dates from scan with Tesseract (OCR) proving difficult. Assistance required

Extracting dates from scan with Tesseract (OCR) proving difficult. Assistance required - image

I am finding it difficult to extract the dates from the scan below.
Would seem straight forward but the results are not very good.
I've tried to use TextCleaner/Convert to pre-process the image to no avail.
Can anyone help?

Probably you shoudl consider picking better OCR. Tesseract is free and good enough for many purposes, but it is no comparison with leading commertial OCR engines. Here's what ABBYY can do with this image without any prior scaling up or preprocessing (it does all the preprocessing needed automatically). It not only picked up all the text, but also the digits:
You can play around yourself using demo tool here (no registration required). For that particular result I selected "English"/"Text Extraction"/"Auto" parameters.
Disclaimer: I work for ABBYY

Related

Theory, idea for finding copied shapes on an image

The description of my problem is simple, I fear that the problem isn't that simple. I would like to find the copied, duplicated part on an image. Which part of the image is copied and pasted back to the same image to another position(for example by using Photoshop)?
Please check the attached image. The red rectangle containing the value 20 is moved from the price field to the validity field. Please note that the rectangle size and position isn't fixed and unknown, it could vary, just the image is given, no other information.
Could you help me naming a theoretical method, idea, paper, people who are working on the problem above?
I posted my method to here(stackoverflow) instead of Computer Vision to reach as many people I can, because maybe the problem can be transformed. I could think a solution, like looking for the 2 largest rectangle which contain the same values inside a huge matrix(image).
Thanks for your help and time.
Note: I don't want to use the metadata to detect the forgery.

If you have access to the digital version of the forgery, and the forger (or the author of the forger-creation software) is a complete idiot, it can be as simple as looking at the image metadata for signs of 'shopping.
If digital files has been "washed" to remove said signs, or the forgery has been printed and then scanned back to you, it is a MUCH harder problem, again unless the forgers are complete idiots.
In the latter case you can only hope for making the forger's work harder, but there is no way to make it impossible - after all, banknotes can be forged, and they are much better protected than train tickets.
I'd start reading from here: http://www.cs.dartmouth.edu/farid/downloads/publications/spm09.pdf

SHIFT features can be used to identify "similar regions" that might have been copied from a different part of the image. A starting point can be to use OpenCV's SHIFT demo (included in the library) and use parts of the image as input, to see where a rough match is available. Detailed matching can follow to see if the region actually is a copy.

OCR for scanning printed receipts. [duplicate]

Would OCR Software be able to reliably translate an image such as the following into a list of values?
UPDATE:
In more detail the task is as follows:
We have a client application, where the user can open a report. This report contains a table of values.
But not every report looks the same - different fonts, different spacing, different colors, maybe the report contains many tables with different number of rows/columns...
The user selects an area of the report which contains a table. Using the mouse.
Now we want to convert the selected table into values - using our OCR tool.
At the time when the user selects the rectangular area I can ask for extra information
to help with the OCR process, and ask for confirmation that the values have been correct recognised.
It will initially be an experimental project, and therefore most likely with an OpenSource OCR tool - or at least one that does not cost any money for experimental purposes.

Simple answer is YES, you should just choose right tools.
I don't know if open source can ever get close to 100% accuracy on those images, but based on the answers here probably yes, if you spend some time on training and solve table analisys problem and stuff like that.
When we talk about commertial OCR like ABBYY or other, it will provide you 99%+ accuracy out of the box and it will detect tables automatically. No training, no anything, just works. Drawback is that you have to pay for it $$. Some would object that for open source you pay your time to set it up and mantain - but everyone decides for himself here.
However if we talk about commertial tools, there is more choice actually. And it depends on what you want. Boxed products like FineReader are actually targeting on converting input documents into editable documents like Word or Excell. Since you want actually to get data, not the Word document, you may need to look into different product category - Data Capture, which is essentially OCR plus some additional logic to find necessary data on the page. In case of invoice it could be Company name, Total amount, Due Date, Line items in the table, etc.
Data Capture is complicated subject and requires some learning, but being properly used can give quaranteed accuracy when capturing data from the documents. It is using different rules for data cross-check, database lookups, etc. When necessary it may send datafor manual verification. Enterprises are widely usind Data Capture applicaitons to enter millions of documents every month and heavily rely on data extracted in their every day workflow.
And there are also OCR SDK ofcourse, that will give you API access to recognition results and you will be able to program what to do with the data.
If you describe your task in more detail I can provide you with advice what direction is easier to go.
UPDATE
So what you do is basically Data Capture application, but not fully automated, using so-called "click to index" approach. There is number of applications like that on the market: you scan images and operator clicks on the text on the image (or draws rectangle around it) and then populates fields to database. It is good approach when number of images to process is relatively small, and manual workload is not big enough to justify cost of fully automated application (yes, there are fully automated systems that can do images with different font, spacing, layout, number of rows in the tables and so on).
If you decided to develop stuff and instead of buying, then all you need here is to chose OCR SDK. All UI you are going to write yoursself, right? The big choice is to decide: open source or commercial.
Best Open source is tesseract OCR, as far as I know. It is free, but may have real problems with table analysis, but with manual zoning approach this should not be the problem. As to OCR accuracty - people are often train OCR for font to increase accuracy, but this should not be the case for you, since fonts could be different. So you can just try tesseract out and see what accuracy you will get - this will influence amount of manual work to correct it.
Commertial OCR will give higher accuracy but will cost you money. I think you should anyway take a look to see if it worth it, or tesserack is good enough for you. I think the simplest way would be to download trial version of some box OCR prouct like FineReader. You will get good idea what accuracy would be in OCR SDK then.

If you always have solid borders in your table, you can try this solution:
Locate the horizontal and vertical lines on each page (long runs of
black pixels)
Segment the image into cells using the line coordinates
Clean up each cell (remove borders, threshold to black and white)
Perform OCR on each cell
Assemble results into a 2D array
Else your document have a borderless table, you can try to follow this line:
Optical Character Recognition is pretty amazing stuff, but it isn’t
always perfect. To get the best possible results, it helps to use the
cleanest input you can. In my initial experiments, I found that
performing OCR on the entire document actually worked pretty well as
long as I removed the cell borders (long horizontal and vertical
lines). However, the software compressed all whitespace into a single
empty space. Since my input documents had multiple columns with
several words in each column, the cell boundaries were getting lost.
Retaining the relationship between cells was very important, so one
possible solution was to draw a unique character, like “^” on each
cell boundary – something the OCR would still recognize and that I
could use later to split the resulting strings.
I found all this information in this link, asking Google "OCR to table". The author published a full algorithm using Python and Tesseract, both opensource solutions!
If you want to try the Tesseract power, maybe you should try this site:
http://www.free-ocr.com/

Which OCR you are talking about?
Will you be developing codes based on that OCR or you will be using something off the shelves?
FYI:
Tesseract OCR
it has implemented the document reading executable, so you can feed the whole page in, and it will extract characters for you. It recognizes blank spaces pretty well, it might be able to help with tab-spacing.

I've been OCR'ing scanned documents since '98. This is a recurring problem for scanned docs, specially for those that include rotated and/or skewed pages.
Yes, there are several good commercial systems and some could provide, once well configured, terrific automatic data-mining rate, asking for the operator's help only for those very degraded fields. If I were you, I'd rely on some of them.
If commercial choices threat your budget, OSS can lend a hand. But, "there's no free lunch". So, you'll have to rely on a bunch of tailor-made scripts to scaffold an affordable solution to process your bunch of docs. Fortunately, you are not alone. In fact, past last decades, many people have been dealing with this. So, IMHO, the best and concise answer for this question is provided by this article:
https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/
Its reading is worth! The author offers useful tools of his own, but the article's conclusion is very important to give you a good mindset about how to solve this kind of problem.
"There is no silver bullet."
(Fred Brooks, The Mitical Man-Month)

It really depends on implementation.
There are a few parameters that affect the OCR's ability to recognize:
1. How well the OCR is trained - the size and quality of the examples database
2. How well it is trained to detect "garbage" (besides knowing what's a letter, you need to know what is NOT a letter).
3. The OCR's design and type
4. If it's a Nerural Network, the Nerural Network structure affects its ability to learn and "decide".
So, if you're not making one of your own, it's just a matter of testing different kinds until you find one that fits.

You could try other approach. With tesseract (or other OCRS) you can get coordinates for each word. Then you can try to group those words by vercital and horizontal coordinates to get rows/columns. For example to tell a difference between a white space and tab space. It takes some practice to get good results but it is possible. With this method you can detect tables even if the tables use invisible separators - no lines. The word coordinates are solid base for table recog

We also have struggled with the issue of recognizing text within tables. There are two solutions which do it out of the box, ABBYY Recognition Server and ABBYY FlexiCapture. Rec Server is a server-based, high volume OCR tool designed for conversion of large volumes of documents to a searchable format. Although it is available with an API for those types of uses we recommend FlexiCapture. FlexiCapture gives low level control over extraction of data from within table formats including automatic detection of table items on a page. It is available in a full API version without a front end, or the off the shelf version that we market. Reach out to me if you want to know more.

Here are the basic steps that have worked for me. Tools needed include Tesseract, Python, OpenCV, and ImageMagick if you need to do any rotation of images to correct skew.
Use Tesseract to detect rotation and ImageMagick mogrify to fix it.
Use OpenCV to find and extract tables.
Use OpenCV to find and extract each cell from the table.
Use OpenCV to crop and clean up each cell so that there is no noise that will confuse OCR software.
Use Tesseract to OCR each cell.
Combine the extracted text of each cell into the format you need.
The code for each of these steps is extensive, but if you want to use a python package, it's as simple as the following.
pip3 install table_ocr
python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png
That package and demo module will turn the following table into CSV output.
Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4
If you need to make any changes to get the code to work for table borders with different widths, there are extensive notes at https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html

Tools for Feature Extraction from Binary Data of Images

I am working on a project where I am have image files that have been malformed (fuzzed i.e their image data have been altered). These files when rendered on various platforms lead to warning/crash/pass report from the platform.
I am trying to build a shield using unsupervised machine learning that will help me identify/classify these images as malicious or not. I have the binary data of these files, but I have no clue of what featureSet/patterns I can identify from this, because visually these images could be anything. (I need to be able to find feature set from the binary data)
I need some advise on the tools/methods I could use for automatic feature extraction from this binary data; feature sets which I can use with unsupervised learning algorithms such as Kohenen's SOM etc.
I am new to this, any help would be great!

I do not think this is feasible.
The problem is that these are old exploits, and training on them will not tell you much about future exploits. Because this is an extremely unbalanced problem: no exploit uses the same thing as another. So even if you generate multiple files of the same type, you will in the end have likely a relevant single training case for example for each exploit.
Nevertheless, what you need to do is to extract features from the file meta data. This is where the exploits are, not in the actual image. As such, parsing the files is already much the area where the problem is, and your detection tool may become vulnerable to exactly such an exploit.
As the data may be compressed, a naive binary feature thing will not work, either.

You probably don't want to look at the actual pixel data at all since the corruption most (almost certain) lay in the file header with it's different "chunks" (example for png, works differently but in the same way for other formats):
http://en.wikipedia.org/wiki/Portable_Network_Graphics#File_header
It should be straight forward to choose features, make a program that reads all the header information from the file and if the information is missing and use this information as features. Still will be much smaller then the unnecessary raw image data.
Oh, and always start out with simpler algorithms like pca together with kmeans or something, and if they fail you should bring out the big guns.

OCR error correction algorithms

I'm working on digitizing a large collection of scanned documents, working with Tesseract 3 as my OCR engine. The quality of its output is mediocre, as it often produces both garbage characters before and after the actual text, and misspellings within the text.
For the former problem, it seems like there must be strategies for determining which text is actually text and which text isn't (much of this text is things like people's names, so I'm looking for solutions other than looking up words in a dictionary).
For the typo problem, most of the errors stem from a few misclassifications of letters (substituting l, 1, and I for one another, for instance), and it seems like there should be methods for guessing which words are misspelled (since not too many words in English have a "1" in the middle of them), and guessing what the appropriate correction is.
What are the best practices in this space? Are there free/open-source implementations of algorithms that do this sort of thing? Google has yielded lots of papers, but not much concrete. If there aren't implementations available, which of the many papers would be a good starting place?

For "determining which text is actually text and which text isn't" you might want to look at rmgarbage from same department that developed Tesseract (the ISRI). I've written a Perl implementation and there's also a Ruby implementation. For the 1 vs. l problem I'm experimenting with ocrspell (again from the same department), for which their original source is available.
I can only post two links, so the missing ones are:
ocrspell: enter "10.1007/PL00013558" at dx.doi.org]
rmgarbage: search for "Automatic Removal of Garbage Strings in OCR Text: An Implementation"
ruby implementation: search for "docsplit textcleaner"

Something that could be useful for you is to try this free online OCR and compare its results with yours to see if by playing with the image (e.g. scaling up/down) you could improve the results.
I was using it as an "upper bound" of the results I should get when using tesseract myself (after using OpenCV to modify the images).

OCR error correction: How to combine three erroneous results to reduce errors

The problem
I am trying to improve the result of an OCR process by combining the output from three different OCR systems (tesseract, cuneinform, ocrad).
I already do image preprocessing (deskewing, despeckling, threholding and some more). I don't think that this part can be improved much more.
Usually the text to recognize is between one and 6 words long. The lanuage of the text is unknown and quite often they contain fantasy words.
I am on Linux. Preferred language would be Python.
What I have so far
Often every result has one or two errors. But they have errors at different characters/positions. Errors could be that they recognize a wrong character or that they include a non existing character. Not so often they ignore a character.
An example might look in the following way:
Xorem_ipsum
lorXYm_ipsum
lorem_ipuX
A X is a wrong recognized character and an Y is a character which does not exist in the text. Spaces are replaced by "_" for better readibilty.
In cases like this I try to combine the different results.
Using repeatedly the "longest common substring" algorithm between the three pairs I am able to get the following structure for the given example
or m_ipsum
lor m_ip u
orem_ip u
But here I am stuck now. I am not able to combine those pieces to a result.
The questions
Do you have
an idea how to combine the different
common longest substrings?
Or do you have a better idea how to solve this problem?

It all depends on the OCR engines you are using as to the quality of the results you can expect to get. You may find that by choosing a higher quality OCR engine that gives you confidence levels and bounding boxes would give you much better raw results in the first place and then extra information that could be used to determine the correct result.
Using Linux will restrict the possible OCR engines available to you. Personally I would rate Tesseract as 6.5/10 compared to commercial OCR engines available under Windows.
http://www.abbyy.com/ocr_sdk_linux/overview/ - The SDK may not be cheap though.
http://irislinktest.iriscorporate.com/c2-1637-189/iDRS-14-------Recognition--Image-preprocessing--Document-formatting-and-more.aspx - Available for Linux
http://www.rerecognition.com/ - Is available as a Linux version. This engine is used by many other companies.
All of the engines above should give you confidence levels, bounding boxes and better results than Tesseract OCR.
https://launchpad.net/cuneiform-linux - Cuneiform, now open sourced and running under Linux. This is likely one of your three engnines you are using. If not you should probably look at adding it.
Also you may want to look at http://tev.fbk.eu/OCR/Products.html for more options.
Can you past a sample or two of typical images and the OCR results from the engines. There are other ways to improve OCR recognition but it would depend on the images.

Maybe repeat the "longest common substring" until all results are the same.
For your example, you would get the following in the next step:
or m_ip u
or m_ip u
or m_ip u
OR do the "longest common substring" algorithm with the first and second string and then again the result with the third string. So you get the same result or m_ip u more easy.
So you can assume that letters should be correct. Now look at the spaces. Before or there are two times l and once X, so choose l. Between or and m_ip there are two times e and once XY, so choose e. And so on.

I'm new to OCR, but until now I find out that those systems are build to work based on a dictionary of words rather than letter by letter. So, if your images doesn't have real words, maybe you will have to look closer to the letter recognition & training part of the systems you are using.

I afforded a very similar problem.
I hope that this can help: http://dl.tufts.edu/catalog/tufts:PB.001.011.00001
See also software developed by Bruce Robertson: https://github.com/brobertson/rigaudon

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio