Finding a line in an image file - image

I am trying to crop out the printer marks that are at the edges of a PDF.
The path i want to take to solve this problem is as follows:
Convert PDF into a bitmap and then traverse the bitmap and try to find the lines, then once the lines are found, find the coordinates of the edges of these lines, set the cropping coordinates to the coordinates just found.
However the problems that pop up in my mind with this approach is how to know when the lines end and the actual page starts. How to differentiate lines from letters.
How do I overcome these hurdles, or is there a better way to crop out the printer marks from a PDF?

There is no general answer that works for ALL PDF files, however there are a few useful strategies that are implemented by existing solutions for graphics arts such as callas pdfToolbox (watch it, I'm associated with this product) or PitStop. The strategies center around a number of facts:
trim and bleed marks are usually simple lines (though thin rectangles are sometimes used as well). They are short and straight (horizontal or vertical).
These marks are usually drawn in specific colours. Either CMYK with the color set to 100%, 100%, 100%, 100% or - more commonly - a special spot color called "All". You're almost guaranteed of this because these marks need to show up on every printed separation (sorry for the technical printing terms if you're not familiar with them).
These marks normally are mirrored symmetrically. You're not looking for a single mark - you're looking for a set of them and this typically helps with recognition a lot. Watch out however that you're not confused by bad applications which don't place marks with absolute accuracy.
Lastly but perhaps not important in your application, different regions can actually work with different types of marks. Japanese trim and bleed marks for example look completely different than European or US marks.

Related

anyway to remove algorithmically discolorations from aerial imagery

I don't know much about image processing so please bear with me if this is not possible to implement.
I have several sets of aerial images of the same area originating from different sources. The pictures have been taken during different seasons, under different lighting conditions etc. Unfortunately some images look patchy and suffer from discolorations or are partially obstructed by clouds or pix-elated, as par example picture1 and picture2
I would like to take as an input several images of the same area and (by some kind of averaging them) produce 1 picture of improved quality. I know some C/C++ so I could use some image processing library.
Can anybody propose any image processing algorithm to achieve it or knows any research done in this field?
I would try with a "color twist" transform, i.e. a 3x3 matrix applied to the RGB components. To implement it, you need to pick color samples in areas that are split by a border, on both sides. You should fing three significantly different reference colors (hence six samples). This will allow you to write the nine linear equations to determine the matrix coefficients.
Then you will correct the altered areas by means of this color twist. As the geometry of these areas is intertwined with the field patches, I don't see a better way than contouring the regions by hand.
In the case of the second picture, the limits of the regions are blurred so that you will need to blur the region mask as well and perform blending.
In any case, don't expect a perfect repair of those problems as the transform might be nonlinear, and completely erasing the edges will be difficult. I also think that colors are so washed out at places that restoring them might create ugly artifacts.
For the sake of illustration, a quick attempt with PhotoShop using manual HLS adjustment (less powerful than color twist).
The first thing I thought of was a kernel matrix of sorts.
Do a first pass of the photo and use an edge detection algorithm to determine the borders between the photos - this should be fairly trivial, however you will need to eliminate any overlap/fading (looks like there's a bit in picture 2), you'll see why in a minute.
Do a second pass right along each border you've detected, and assume that the pixel on either side of the border should be the same color. Determine the difference between the red, green and blue values and average them along the entire length of the line, then divide it by two. The image with the lower red, green or blue value gets this new value added. The one with the higher red, green or blue value gets this value subtracted.
On either side of this line, every pixel should now be the exact same. You can remove one of these rows if you'd like, but if the lines don't run the length of the image this could cause size issues, and the line will likely not be very noticeable.
This could be made far more complicated by generating a filter by passing along this line - I'll leave that to you.
The issue with this could be where there was development/ fall colors etc, this might mess with your algorithm, but there's only one way to find out!

How do font ID algorithms work?

I was wondering how automatic font identification services (like WhatTheFont, not question-based ones like Identifont) work. The most basic variant would be a service that lets you upload an image that contains text, and the service returns the name of the font used. How is this done, and how is it done so fast as to be practical? I'm fairly new to this kind of thing, but here's my understanding so far:
Perhaps some pre-processing to reduce noise. I'm not particularly interested in this part.
First the image is run through an OCR to extract the text – simple enough.
Then you go through every font in the tens/hundreds-of-thousands in your database and render the text you have extracted in each one, seeing if it's close to the original. Adjusting for size, alignment, kerning, different weights or italics, etc. How is this possibly fast enough to be practical?
Is this correct?
Please offer some insight into how this is done, and how it's done efficiently.
Let us assume you are doing the match in the raster representation (not on vectorized outlines).
Indeed, the text should be recognized first to reduce the number of comparisons with the characters in the reference fonts; at this stage it matters to avoid any dubious recognition as this would wreak havoc.
Then a stage of normalization is needed: you can transform the character position, size (and possibly italics angle ?) to a standard bounding box so that pixel-by-pixel comparison becomes possible. Then the amount of computation will be proportional to the area of the characters times the number of reference fonts.
Beware that normalization in size is not fully accurate as big character shrunk will differ from a smaller character in the same font in a few details and stroke thickness. It is probably useful to consider two or three representative sizes per font.

Restoring an old manuscript with image processing

Say i have this old manuscript ..What am trying to do is making the manuscript such that all the characters present in it can be perfectly recognized what are the things i should keep in mind ?
While approaching such a problem any methods for the same?
Please help thank you
Some graphics applications have macro recorders (e.g. Paint Shop Pro). They can record a sequence of operations applied to an image and store them as macro script. You can then run the macro in a batch process, in order to process all the images contained in a folder automatically. This might be a better option, than re-inventing the wheel.
I would start by playing around with the different functions manually, in order to see what they do to your image. There are an awful number of things you can try: Sharpening, smoothing and remove noise with a lot of different methods and options. You can work on the contrast in many different ways (stretch, gamma correction, expand, and so on).
In addition, if your image has a yellowish background, then working on the red or green channel alone would probably lead to better results, because then the blue channel has a bad contrast.
Do you mean that you want to make it easier for people to read the characters, or are you trying to improve image quality so that optical character recognition (OCR) software can read them?
I'd recommend that you select a specific goal for readability. For example, you might want readers to be able to read the text 20% faster if the image has been processed. If you're using OCR software to read the text, set a read rate you'd like to achieve. Having a concrete goal makes it easier to keep track of your progress.
The image processing book Digital Image Processing by Gonzalez and Woods (3rd edition) has a nice example showing how to convert an image like this to a black-on-white representation. Once you have black text on a white background, you can perform a few additional image processing steps to "clean up" the image and make it a little more readable.
Sample steps:
Convert the image to black and white (grayscale)
Apply a moving average threshold to the image. If the characters are usually about the same size in an image, then you shouldn't have much trouble selecting values for the two parameters of the moving average threshold algorithm.
Once the image has been converted to just black characters on a white background, try simple operations such as morphological "close" to fill in small gaps.
Present the original image and the cleaned image to adult readers, and time how long it takes for them to read each sample. This will give you some indication of the improvement in image quality.
A technique call Stroke Width Transform has been discussed on SO previously. It can be used to extract character strokes from even very complex backgrounds. The SWT would be harder to implement, but could work for quite a wide variety of images:
Stroke Width Transform (SWT) implementation (Java, C#...)
The texture in the paper could present a problem for many algorithms. However, there are technique for denoising images based on the Fast Fourier Transform (FFT), an algorithm that you can use to find 1D or 2D sinusoidal patterns in an image (e.g. grid patterns). About halfway down the following page you can see examples of FFT-based techniques for removing periodic noise:
http://www.fmwconcepts.com/misc_tests/FFT_tests/index.html
If you find a technique that works for the images you're testing, I'm sure a number of people would be interested to see the unprocessed and processed images.

Detecting empty pages in scanned documents

So we need to detect whether an image, created by a scanner, represents an empty page. I'm way out of my depth when it comes to image processing, so I have to run this by the community.
Here's what I have come up with so far:
Empty pages can be glaringly white, gray recycled paper, or yellowed old paper. The current idea is to create a histogram for a page, look for a steep increase of the curve, and get the percentage of pixels are darker than that. If that exceeds a threshold, the page is likely not empty.
Since this would likely classify a page containing a single line of text at the top as empty, we would tile the page and gather statistics about each tile.
We would need to detect scanned staplers and holes from binding (likely only in certain tiles), but this can be put off to some later stage. However, if you have an idea what to look out for besides these two, please mention it in a comment.
This needs to be fast. It's part of a document processing workflow that processes (tens of) thousands of pages per day. If processing a page takes ten seconds longer, than our customers will have to tell their customers that they'll have to wait several days longer for their results. (If this results in more false positives, some customers would rather have someone check a few dozen found "empty" pages, than have their customer wait one more day.)
So here's my questions:
Is it a good idea to take this route or is there something better?
If we do it this way, how would I do this? What's a good (cheap) algorithm for finding a threshold for a page? Could we gain significant speed by assuming a similar threshold for a batch of documents? To which precision could brightness values be rounded, before getting logged? What quirks could we expect?
If you know that a scanned page is going to fill the image entirely, then calculating the standard deviation might be a good way of doing this.
I would suggest blurring page slightly to reduce some noise. Then calculate the SD for the page, in theory, a page the is more or less all one colour will have a low SD and one with lots of text will have a higher SD. Then it's a case of 'training' the system to work out when a page is plain and when it is text. You might find that certain pages are hard for it to tell.
You could have it trained by having it process a vast number of pages, and it goes through them all, and you say if it is plain or not.
EDIT
ok, a white page with black text, if we have just the page and no surrounding stuff, will have a mean colour of grey, probably a fairly light grey. Getting the average is a for loop through all the pixels, adding their values and then dividing by the number of pixels. I'm not good with this o(logN) stuff, but suffice to say, it will not that long. Unless you have HUGE images.
SD is a second for loop, this time we are counting up how different each pixel is from the mean, and then dividing by the mean. This will take a bit longer then the mean, as we have to do something like
diff = thispixel - mean;
if(diff < 0) {
diff = -diff;
}
runningTotal += diff;
For a plain coloured page, each pixel will be close to the mean value, thus our SD will be low. If the SD is below a certain value, we can assume that this means the page is all one colour.
This might have problems if their is very minimal amount of text, as it will not have a large influence on the SD, so maybe like you suggested in the question, break the page into sections. I suggest strips horizontally, as text tends to go this way. If we do one of these strips one at a time, once one strip suggests it has text, we can stop as we don't care if the rest is blank or not.
Blurring the page will help reduce noise, as the odd pixel of noise will be reduced in its impact, thus give you a 'tighter' SD. You could also use it to reduce the resolution of your image.
Say you sauce image is 300 wide by 900 high, you could sample pixels in blocks of nine, 3 *3, and thus end up with an image that is 100 wide by 300 high, so it can actually be used to reduce the amount of calculations you need to do, in this case by a ninth!
The main problem is going to be in working out how high an SD can be with just a plain page. Maybe have it find the SD of a load of blank pages.
By the sounds of it, you are probably going to want to have a middle ground that lets it be unsure and ask for human intervention, possibly letting the human value train the system to get better?
Perform some sort of simple edge detection. If the number of pixels constituting edges is below some threshold, then there's going to be a high probability the page is empty. This could be improved by classifying certain edges that correspond with high certainty (by shape and location) to punched holes and staples as trivial and discounting them from the metric.
When I worked for a document processor (~8 years ago), we handled client projects varying from very "clean" only-US-letter-sized pages to cover-/cardstock of irregular shapes mixed with normal pages. Operators fed pre-sorted files into scanning machines and only had to watch for folded corners and similar mechanical problems. Their output was multiple streams of hundreds of images corresponding to a range of files. A single scanner operator could easily scan 15k pieces of paper in a shift (that's only 0.60 pages/sec, while a scanner at speed could handle 3 pages/sec and still scan both sides). Later operators processed those looking for key pages to mark file start and end. (Image recognition can be used here, sometimes, but people also provide a quality check on the first operators.) We had many variables that could be set per client project.
I'm basing the rough outline below on that experience and how it appears that your goals and workflow are similar.
(Terminology: By client I mean our client, e.g. a specific bank. A project or client project is a set of documents from that client that contains many files, e.g. all mortgages handled by a specific branch in a given year. A file is a logical arrangement that would normally be a physical file folder for one of the client's customers, e.g. all mortgage papers for one address.)
Cut off the top, bottom, sides, and corners. Throw these out of your calculations (even though you'll probably store them in the final image). This will cover staple holes, binder holes, but also (tiny) folded corners and very minute torn edges which appear as black spots. Depending on how you're scanning, the last two may be less of a problem.
Vary the sizes of these cuts for each client project, as required. For example, even a very thin edge slice, say 1-2mm, will eliminate most ragged edges without increasing false positive rate.
Convert to black and white, 1 bit per pixel. I suspect you are already doing this for some client projects anyway, so doing this efficiently and effectively, which can be subtle, should be no extra work. (Even if you don't store the 1bpp image as the deliverable result, the conversion will be helpful in empty page detection.) Eliminate noise by dropping any black pixels with none or only one black neighbor (using all surrounding 8 as neighbors).
After cutting extremities (#1) and this simplistic noise reduction, blank pages will have a very low number of black pixels; most blanks will have no black pixels at all – exempting exceptionally poor page quality, inked stamps (when scanning back-sides, mentioned more below), or other circumstances across the whole project, and so forth.
Depending on client project, you may set hotspots to be watched – the converse of cutting off the sides. For example, watching a 1" strip where a single line at the top of the page would appear may reduce false positives. A low contrast scan or faded hardcopy (perhaps even pencil, which can be common on back-sides) with only one line of text will be distinguished from a blank page this way.
What sections are worth watching depends on each project, but do not try to divide the page up into tiles and then subdivide those tiles into areas of interest. Instead, parallelize this on the page level; e.g. 1 worker per core, each worker handles a full page at a time.
Depending on how you're keying individual files, you may find it helpful to drop blanks (before marking start-of-file pages, which is still often a manual process even at high volume) then watch for blank pages at unexpected points after files have been keyed (e.g. expected would be the last page of the file, without being two blanks in a row, etc.).
For example, if a particular project is only scanning one side of each page, then detecting two blank pages in a row is a good indication that a couple pages in a file were flipped upside-down (clients often hand over hardcopy files like this). Either the sorters (who remove things like staples and paperclips) or the first machine operators should have caught this mistake, but, regardless, it will now need a manual check to verify.
On the other hand, there were projects that had very clean files so sorters could insert (usually colored) blank pages marking file boundaries. In this case, the second set of people still did the keying by file number, but only had to examine the first page of each file. This wasn't rare, but not common either.
Before I start rambling a bit, I hope my main point comes across: you have to decide how to mitigate rates of false positives (= data loss) and false negatives (= annoying blanks and otherwise harmless, but a maximum allowed rate may still be specified in the project contract). That varies drastically by project and the type of files/documents you're handling, but it guides you in how to do the detection. You will get much better results from a tailored approach than trying one-size-fits-all, even if the tailored approaches are 80-98% similar.
If you're delivering 1bpp images to the client, for example, you might not even want/need to eliminate blanks as filesize (and ultimately size of the delivered dataset) won't be an issue. This can be an acceptable trade-off when eliminating most blanks is harder while maintaining a low false positive rate; such as for files with inked stamps ("received on", "accepted", "due date", etc.; they bleed through to the back) or other problems, for example.
My fall class does a bunch of image-processing projects.
Here's what I would try:
Project from color to grayscale
Pour all the pixels into a simple histogram with say 100 buckets between 0 and 1
Find a local minimum in the histogram such that the absoluete value of above - below is as small as possible, where above is the number of brighter pixels and below is the number of darker pixels
Force the above pixels to white and the below pixels to black
If you like, as an extra step you could remove black edges
If there are hardly any black pixels, the page is blank
The first two steps should be combined, and they are the only time-consuming steps; on a 600dpi images you may have to touch many millions of pixels. The rest will be lightning fast. I'd be very surprised if you can't classify multiple images per second—especially if you know there will be no black edges.
The only part that requires training or experiment is the last step. It's also possible that you will need to fiddle around with the number of buckets in the histogram; if there are too many buckets, you may have a bad local minimum.
Good luck, and report back to us how you make out!
Check out this line detection algorithm: http://homepages.inf.ed.ac.uk/rbf/HIPR2/linedet.htm. In addition to a detailed explanation of how the algo works there's a demo where you can use your own image and see the results. I tried two images: 1) a B&W scan of a receipt, 2) the B&W, "blank" back side of that same receipt. All of the edge detection algorithms I tried found edges on the "blank" page. But, this line detection algorithm was the only algorithm that correctly found lines on the front page and yet didn't find anything on the "blank" back page.
It looks as if you're trying to convert all paperwork for a company into digital documents. Some of this paper can be really old.
Say your text is black, and any other color is the background. If you take two weighted averages, one consisting of what you think is the text, and one consisting of the background, you can compare those two and see if they're distant enough to consider further evaluation. This will removing any uneven aging of the paper.
Staple holes and punched holes in paper are pretty standard in size, but they'd show up as gray or not at all if you're scanning on a white background. If not, then you can guess where these are and remove them.
Now, we look at areas of high interest, areas where the black pixels are the most dense. Select a portion of that and OCR it. Place the starting top-left closest to an area where text begins. On a typical document, a solid blank linear area going left-to-right and another going top-to-bottom denotes the top and left sides of a paragraph. You can be sure that you got a line of text because below a line of text is another blank left-to-right area. So you don't need to worry about selecting a portion that will chop text in half.
You could take the mean gray level (integer) of each few rows of the scanned image (depending on the resolution and how many lines are required to capture one line of text), then consider the spread of row means. If there is no text on the page, the spread of means should be small (i.e. background ranges from 250-255), and if there is text on the whole page or on part of the page, the spread would be much larger (i.e. 15 for text to 250 for background).
Seems to me like the solution should be computationally simple due to the large number of pages to check. Approaches requiring further processing (edge detection, filtering, etc) seem like overkill, and will take much longer to run.
There is no need to process pixel by pixel, using matrices will help this be more efficient, for example using Numpy you can calculate means, sums, etc. for entire rows, columns or matrices at once much more efficiently. There is also no need to process EVERY pixel, a good sample of rows should be able to accomplish the task with similar accuracy. 8bit accuracy should be fine, and you could even resample to large pixels before running this processing algorithm.
You can do a noisy trim, i.e. blur the image and do an auto-trim (without actually modifying the image). If the width or height of the trim result is below a threshold (e.g. 80 to 100 for a 600 dpi image) then the page is empty.
A proof of concept using the ImageMagick command line front-end:
$ convert scan.png -shave 300x0 -virtual-pixel White -blur 0x15 -fuzz 15% \
-trim info:
The above command assumes a 600 dpi DIN A4 black and white (1 Bit) image. It also ignores a margin of 300 pixels such that artifacts like perforation holes don't yield false negatives.

Display of colorblind images

Quick question: At this website here http://www.vischeck.com/examples/ there are a few pictures of numbers hidden within another color to test for color blindness. Is there any way that these images can be generated algorithmically?
They are based on ready-made dot fields, and you overlay a number on them, and do the coloring on the whole dot each time it is partially filled. If You know the correct colors - that will do ;)
What language are you coding in? It's impossible to give any definitive answer without knowing your problem well.
If you're in .NET, GDI is your best bet for generating such a dot field, but it is not simple to do algorithmically, and it's possible that these were hand-drawn.
One easier possibility you have is to use an evenly-spaced circles field, even if it's not as elegant.
Then, you'd pick two colors that aren't supposed to be (easily) distinguishable by (certain?) color-blind people.
Now, you draw a number in the square field (using one of 10 matrices for the numbers 0-9 that represent each number character with sizes compatible with the circle field) using (limited) random variations of the two colors that the person shouldn't distinguish.
In other words, if the person isn't supposed to distinguish red and green, you'd make a character using shades of red on top of a shades of green background.
You'd possibly need HUE>RGB functions, for .NET you'd have to look for a library (I remember using one from codeproject).

Resources