Windows font rendering - windows

I am writing an application which gets the data stream going to the printer (from Redmon) as input. The data stream has text rendered as monochromatic bitmap which printer would use to print it on the paper. I plan to parse this data stream and understand the text going to the printer. My application would parse the data coming from any windows application and going to the printer.
The way I parse the data stream is by matching the pixel information (byte by byte) and if there is an exact match then I can uniquely identify a character. For this I am assuming that all windows applications use same windows renderer to render the font in terms of pixel. Hence I would always get the same sequence of bytes for a particular character from any application if these application(including the ones based on java) use same font and font size for printing their text. Is this a correct assumption or do windows provide various options to applications for rendering the text for printing ?
Also is there a library which I can use for doing character recognition using monochromatic bitmap data ?
NOTE: The printers I am using are ESC/POS compatible printers. The printer driver for these printers send the data to be printed as a monochromatic bitmap.

I'm not familiar with ESC/POS printers, but if you can guarantee the driver always renders text as monochrome bitmaps, your chances of characters being identical for the same font and size are very high, but they are not 100%. First, you also need to account for rotation, scaling and shearing. You would need to consider the entire transformation matrix, not just the font size.
There are at least two other failure points I can think of: 1) Text overlaid with transparencies and 2) if the machine has alternate fonts installed with the same names. For example, common fonts like Helvetica can be obtained from many sources and the characters will not be identical between them. A third possible failure is an application that ignores the fact that a printer is monochromatic and prints in color or grayscale. Converting color or grayscale to monochrome will produce different bitmaps for different colors.
As for OCR software, Wikipedia has a nice comparison chart of OCR SDKs.

Related

Read encoded files

I was trying to read some files like images, but when I try to open them with the notepad I found weird codes like this:
ÿH‹\$0H‹t$8HƒÄ _ÃÌÌÌÌÌÌH‰\$H‰l$H‰t$ WAVAWHƒì ·L
Click here to see the image
So I have the following questions:
Why do I find those weird symbols instead of zeros and ones?
Does programmers do this for security or optimization?
Is this an encoding such as ASCII that every symbol has an unique decimal and binary number associated?
Can anyone with the correspondent decoder read this information?
Thank you
Most data files like images are stored as hexadecimal. If you know the format of the file, you can use a hexadecimal editor (I use HexEdit) to look at the data.
A colour is often stored as RGB, meaning Red, Green, or Blue, so for instance, this is a dark red:
80 00 00 // (there are no spaces in the real file format, but hex editors add them.)
The format of an image depends on how it's stored. Most image formats have ways of encoding the difference between pixels rather than the actual pixels themselves, because there's a lot of information redundancy between the different pixels.
For instance, if I have a picture of the night sky with a focus on the moon, there's probably a big area in one corner that's all much the same shade of grey; encoding that without optimization would mean a hell of a lot of file that just read:
9080b09080b09080b09080b09080b09080b09080b59080b59080b5...
In this case, the grey is slightly bluish-purple, tending towards a brighter blue at the end. I've stored it as RGB here - R:90, G:80, B:b0 - but there are other formats for that storage too. Try them out here.
Instead of listing every pixel, I could equally say instead "6 lots of bluish-gray then it gets brighter in blue":
=6x9080b0+3x000005+...
This reduces the amount of information I would need to transmit. Most optimizations aren't quite that human-readable, but they operate on similar lines (this is a general information principle used in all kinds of things like .zip files too, not just images).
Note that this is still a lossless format; I could always get back to the actual pixel-perfect image. Bitmaps (.bmp) are lossless (though obviously still digital; they will never capture everything a human sees).
A number of formats use the frequency of images to encode the information. It's a bit like looking at a wave form of music, except it's two-dimensional. Depending on the sampling frequency, information could easily be lost here (and often is). JPEGs (.jpg) use lossy compression like this.
The reason you see ASCII characters is because some of the values just happen to coincide with ASCII text codes. It's pure coincidence; Notepad is doing its best to interpret what's essentially gibberish. For instance this colour sequence:
4e4f424f4459
happens to coincide with the letters "NOBODY", but also represents two pixels next to each other. Both are grey, especially the left (R:4e, G:4f, B:42) with the right-most one being a bit more blue (R:4f, G:44, B:59).
But that's only if your format is storing raw pixel information... which is expensive, so it probably isn't the case.
Image formats are a pretty specialist area. The famous XKCD cartoon "Digital Data" showcases the optimizations being made in some of them. This is why, generally speaking, you shouldn't use JPEG for text, but use something like PNG (.png) instead.

Ghostscript command for finding the number of colors used for each page in pdf file

I'm new to GhostScript. Can you let me know the Ghostscript command for finding the number of colors used for each page in pdf file. I need to parse the results of this command from java program
There is no such Ghostscript command or device. It would also be difficult to figure out; so much depends on what you mean. Do you intend to count the colour of each pixel in every image for example ? Which colour spaces are you interested in ? What about ICCBased colour spaces, do you want the component values, or the CIE values ?
[edit]
Yeah there's no Ghostscript equivalent, I did say that.
You wuold have to intercept every call to the colour operators, examine the components being supplied and see if they were no black and white. For example, if you set a CMYK colour with C=M=Y=0 and K!=0 then its still black and white. Similar arguments apply for RGB, CIE and ICC colour spaces.
Now I bet ImageMagick doesn't do that, I suspect it simply uses Ghostscript to render a bitmap (probably RGB) and then counts the number of pixels of each colour in the output. Image manipulation tools pretty much all have to have a way to do that counting already, so its a low cost for them.
Its also wrong.
It doesn't tell you anything about the original colour. If you render a colour object to a colour space that is different to the one it was specified in, then the rendering engine has to convert it from the colour space it was in, to the expected one. This often leads to colour shifts, especially when converting from RGB to CMYK but any conversion will potentially have this problem.
So if this is what ImageMagick is doing, its inaccurate at best. It is possible to write PostScript to do this accurately, with some effort, but exactly what counts as 'colour' and 'black and white' is still a problem. You haven't said why you want to know if an input file is 'black and white' (you also haven't said if gray counts as black and white, its not the same thing)
I'm guessing you intend to either charge more for colour printing, or need to divert colour input to a different printer. In which case you do need to know if the PDF uses (eg) R=G=B=1 for black, because that often will not result in C=M=Y=0 K=1 when rendered to the printer. Not only that, but the exact colour produced may not even be the same from one printer to another (colour conversion is device-dependent), so just because Ghostscript produced pure black doesn't mean that another printer would.
This is not a simple subject.

Speeding up postscript image print

I am developing an application that prints an image via generating postscript output and sending it to the printer. So I convert my image to jpg, then to ASCII85 string, append this data to postscript file and send it to the printer.
Output looks like:
%!
{/DeviceRGB setcolorspace
/T currentfile/ASCII85Decode filter def
/F T/DCTDecode filter def
<</ImageType 1/Width 3600/Height 2400/BitsPerComponent
8/ImageMatrix[7.809 0 0 -8.053 0 2400]/Decode
[0 1 0 1 0 1]/DataSource F>> image F closefile T closefile}
exec
s4IA0!"_al8O`[\!<E1.!+5d,s5<tI7<iNY!!#_f!%IsK!!iQ0!?(qA!!!!"!!!".!?2"B!!!!"!!!
---------------------------------------------------------------
ASCII85 data
---------------------------------------------------------------
bSKs4I~>
showpage
My goal now is to speed up this code. Now it takes about 14 seconds from sending .ps to the printer to the moment printer actually starts printing the page (for the 2MB file).
Why is it so slow?
Maybe I can reformat the image so printer doesn't need to perform an affine transform of the image?
Maybe i can use better image encoding?
Any tutorials, clues or advices would be valuable.
One reason its slow is because JPEG is an expensive compression filter. Try using Flate instead. Don't ASCII85 encode the image, send it as binary, that reduces transmission time and removes another filter. Note that jpeg is a lossy compression, so by 'converting to jpeg' you are also sacrificing quality.
You can reduce the amount of effort the printer goes to by creating/scaling the image (before creating the PostScript) so that each image sample matches one pixel in device space. On the other hand, if you are scaling an image up, this means you will need to send more image data to the printer. But usually these days the data connection is fast.
However this is usually hard to do and often defeated by the fact that the printer may not be able to print to the edge of the media, and so may scale the marking operations by a small amount, so that the content fits on the printable area. Its usually pretty hard to figure out if that's going on.
Your ImageMatrix is, well, odd..... It isn't a 1:1 scaling and floating point scale factors are really going to slow down the mapping from user space to device space. And you have a lot of samples to map.
You could also map the image samples into PostScript device space (so that bottom left is at 0,0 instead of top left) which would mean you wouldn't have to flip the CTM In the y axis.
But in short, trying to play with the scale factors is probably not worth it, and most printers optimise these transformations anyway.
The colour model of the printer is usually CMYK, so by sending an RGB image you are forcing the printer to do a colour conversion on every sample in the image. For your image that's more than 8.5 million conversions.

Microsoft Word -> PDF image quality

Our line of business application uses a Word document as a template, fills in the pertinent information and converts it to PDF, which it returns to the user.
That all works fine except for one thing. We use an image of our company's logo on the lead page and in the footer. In one resolution (e.g. 100%), it looks fine. But at higher resolutions (e.g. 250%), it has several noticeable jaggies; the diagonals have noticeable ragged edges. Tweaking the image, we're able to make it look good at the higher zoom value, but then it looks terrible at lower zoom values.
Currently, we're using a PNG, but we've tried JPG and it doesn't improve the jaggy problem. In fact, it looks worse at higher resolution because of JPG compression. I think a vector image would solve the problem (and we have the logo in vector format), but I haven't found any vector formats that Word supports.
I don't really have any code to show, since we don't do anything with the image in the code: we just take the document and plug in our values, none of which touch the logo (the template already contains the image).
We are using Word 2013 (32-bit) on Windows 8.1 (though some of our developers use Windows 7). We use the .NET PdfDocument class to generate the PDF.
Any ideas on how to get Word to be better at retaining image quality? Or is this a PDF issue?
The suggestion by David van Driessche might still work, provided the right EMF is used. EMF files can contain both raster and vector data. With a raster EMF file, the same problem will present itself as it did with PNG or JPEG. Vector EMF embedded in Word files can scale very nicely, at least when zoomed in display, so it could also work with printing or converting to PDF.
Word supports both raster and vector objects within EMFs, so the secret is to use EMFs that only contain scalable objects like lines, curves and text when quality & scaling are both concerns.
I have posted sample files here to illustrate this for anyone wishing to see the difference.
Amin Dodin

What are the minimum margins most printers can handle?

Im creating pdfs server side with lots of graphics so maximizing real estate is a must but at the same time ensuring users printers can handle the tight margins is a must.
Does anyone have an idea what safe values I can use for the margins when authoring the pdfs. In the past Ive used work and home printers with margins of about one cm with no problems but of course I can't take this as the defacto minimum.
Oh and I don't really want to allow the user to specify the margin (50% lazyness 50% will get complicated.)
Ive googled but couldn't find anything concrete. (average minimum margin printing)
Every printer is different but 0.25" (6.35 mm) is a safe bet.
For every PostScript printer, one part of its driver is an ASCII file called PostScript Printer Description (PPD). PPDs are used in the CUPS printing system on Linux and Mac OS X as well even for non-PostScript printers.
Every PPD MUST, according to the PPD specification written by Adobe, contain definitions of a *ImageableArea (that's a PPD keyword) for each and every media sizes it can handle. That value is given for example as *ImageableArea Folio/8,25x13: "12 12 583 923" for one printer in this office here, and *ImageableArea Folio/8,25x13: "0 0 595 935" for the one sitting in the next room.
These figures mean "Lower left corner is at (12|12), upper right corner is at (583|923)" (where these figures are measured in points; 72pt == 1inch). Can you see that the first printer does print with a margin of 1/6 inch? -- Can you also see that the next one can even print borderless?
What you need to know is this: Even if the printer can do very small margins physically, if the PPD *ImageableArea is set to a wider margin, the print data generated by the driver and sent to the printer will be clipped according to the PPD setting -- not by the printer itself.
These days more and more models appear on the market which can indeed print edge-to-edge. This is especially true for office laser printers. (Don't know about devices for the home use market.) Sometimes you have to enable that borderless mode with a separate switch in the driver settings, sometimes also on the device itself (front panel, or web interface).
Older models, for example HP's, define in their PPDs their margines quite generously, just to be on the supposedly "safe side". Very often HP used 1/3, 1/2 inch or more (like "24 24 588 768" for Letter format). I remember having hacked HP PPDs and tuned them down to "6 6 606 786" (1/12 inch) before the physical boundaries of the device kicked in and enforced a real clipping of the page image.
Now, PCL and other language printers are not that much different in their margin capabilities from PostScript models.
But of course, when it comes to printing of PDF docs, here you can nearly always choose "print to fit" or similarly named options. Even for a file that itself does not use any margins. That "fit" is what the PDF viewer reads from the driver, and the viewer then scales down the page to the *ImageableArea.
As a general rule of thumb, I use 1 cm margins when producing pdfs. I work in the geospatial industry and produce pdf maps that reference a specific geographic scale. Therefore, I do not have the option to 'fit document to printable area,' because this would make the reference scale inaccurate. You must also realize that when you fit to printable area, you are fitting your already existing margins inside the printer margins, so you end up with double margins. Make your margins the right size and your documents will print perfectly. Many modern printers can print with margins less than 3 mm, so 1 cm as a general rule should be sufficient. However, if it is a high profile job, get the specs of the printer you will be printing with and ensure that your margins are adequate. All you need is the brand and model number and you can find spec sheets through a google search.
The margins vary depending on the printer. In Windows GDI, you call the following functions to get the built-in margins, the "no-print zone":
GetDeviceCaps(hdc, PHYSICALWIDTH);
GetDeviceCaps(hdc, PHYSICALHEIGHT);
GetDeviceCaps(hdc, PHYSICALOFFSETX);
GetDeviceCaps(hdc, PHYSICALOFFSETY);
Printing right to the edge is called a "bleed" in the printing industry. The only laser printer I ever knew to print right to the edge was the Xerox 9700: 120 ppm, $500K in 1980.
You shouldn't need to let the users specify the margin on your website - Let them do it on their computer. Print dialogs usually (Adobe and Preview, at least) give you an option to scale and center the output on the printable area of the page:
Adobe
Preview
Of course, this assumes that you have computer literate users, which may or may not be the case.

Resources