If I read some binary from a stream, save it in a text file and then rename it with the .jpg extension, how come the file won't open as an image?
As a reference, I have got the source image, opened it up in notepad and compared both files - side-by-side they have exactly the same content.
I'd guess that you didn't open your text file in binary mode. Some bytes will be changed when you write data in text mode (most notably the end-of-line byte sequence) and those changes will be ignored by Notepad because it thinks everything is text. Try using comp (I think that's the right command) to compare your files rather than Notepad.
Related
I am using Ghostscript to convert PDF to PDF/A by command line:
gs -dPDFA -dBATCH -dNOPAUSE -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile="output.pdf" input.pdf
But output file lost searchable text property.
How can I obtain searchable PDF/A files as output ?
Thanks.
You haven't supplied an input file to look at, nor mentioned which version of Ghostscript you are using.
Let me start with my standard lecture on this subject; when you take a PDF file as input, and use Ghostscript's pdfwrite device to produce a new PDF file, you are NOT 'converting', 'editing' or 'modifying' the input file.
What happens is that the PDF interpreter interprets the PDF file, and produces a series of graphcs primitives, which it feeds to the graphics library. This then processes these primitives, and passes them to the device. The device then emits them to the output file. In the case of a rendering device (eg TIFF) it renders theoperation to a bitmap and when it reaches the end of file, it writes the bitmap as a file. In the case of pdfwrite, it re-assembles these primtives into a brand new PDF file.
So the output PDF file has nothing in common with the input PDF file, except its appearance.
There are disadvantages to this approach (it does limit us in preserving some non-printing aspects of the input file), but there are also advantages; for instance it permits us to alter colour spaces, flatten transparency, change font encodings etc.
In addition to this you have chosen to create a PDF/A file. PDF/A limits the available features of the PDF specification, and it may be (its impossible to tell without seeing the original file) that it simply isn't possible to represent the original PDF file as a PDF/A file without altering some aspects of it.
Again, without seeing the original file I can tell, but it may be that you simply cannot achieve what you want, or at least not using Ghostscript.
I heard there is some way, to add additional hidden text inside code of the image file (like jpg/png/gif).
If we open this image in windows, will be shown a picture, but if we open it by some text-editor (like notepad++), we will see our hidden text.
How is this method called? What can you say about it?
Thanks.
Look up steganography. There are lots of tools to add any kind of hidden data you want in there. Usually though, it's not readable by notepad though. you need a companion tool to the one you used to add the data in in the first place. Using this you can even hide a binary file inside.
OR... you could look into using the metadata -- EXIF -- of the JPEG. Lots of tools exist to edit that data too. It ends up stored in the header of the file, so it should be right near the beginning, in other words the file would look something like:
JFIF ..... (GARBAGE) ..... Your Metadata ...... (GARBAGE)
Or finally, I hear that you can just concatenate a RAR onto the end of a JPEG and it will work as a (strangely huge) JPEG but WinRAR will notice the RAR contents when you open it in WinRAR.
This is called steganography.
I think its primary industrial use is watermarking content.
Information Hiding: Steganography & Digital Watermarking is a good resource on the topic.
Use "copy" - copy two files in one.
copy /B img.jpg + some.txt
Thus both file will be merged into the img.jpg file. The text from some.txt is append to the end of the img.jpg file.
I have a requirement where in I have to determine whether a photo is corrupted and accordingly tag it as such.
Another thing, I need is to determine if an Image has got wrong extension. What I mean by wrong extension is that sometimes I have come across a photo that has extension of jpg but when I load this photo into IrfanView it reports that the photo is in different format that the extension.
How can I do this in Delphi.
I have a requirement where in I have to determine whether a photo is corrupted and accordingly tag it as such.
You can try some things, but with certain file formats (example: BMP, JPEG to some extent) only a human can ultimately decide if the file is OK or corrupted. The simplest test is to simply load the file into a corresponding object (TJpegImage, TPngObject, etc). If you get an exception while loading you've surely got a corrupted file. Unfortunately if no exception is raised you can't really say the file is not corrupted. I've seen corrupted JPEG files that load just fine into a Delphi TImage and can be opened with Windows's Image Viewer, but are obviously corrupted to a human observer. With BMP images it's even clearer: open up a bitmap, overwrite some bytes in the middle of the file and then open it in a viewer. How can any automated system tell those wrongly colored bits in the middle of the bitmap are actually wrong?
Another thing, I need is to determine if an Image has got wrong extension. What I mean by wrong extension is that sometimes I have come across a photo that has extension of jpg but when I load this photo into IrfanView it reports that the photo is in different format that the extension.
How about doing some of the same, trying to load the file into the object that corresponds to it's extension, and if you fail, try opening up with some other formats? This should be easy.
Alternatively you can investigate image headers: Most file formats start with a short signature, a few bytes. You can look up the documentation of all image file formats and find the signature, or you can simply open up an large number of files and look for a pattern in the first 4 bytes. I'd go for this second alternative since finding proper documentation for all image file formats might be a challenge.
The only way to check if file is corrupted is to try reading it as it is described in file format, ie. load BMP as BMP with reading BMP header, BMP data etc. There are many web pages that describe graphics file formats. Of course if you transmit files and are afraid that it will be corrupted after transmitting then save such files with some sum like CRC32, or even cryptographic MD5 or SHA1. Then after transmitting check if calculated sum is the same as original.
In Delphi there is unit jpeg and types TJPEGImage and TBitmap. Try loading it with data and check exception. For others formats there are many libraries, just look for required file formats.
To check if file extension is good try reading some first bytes of file and check it with some dictionary of graphics file headers. For example GIF files should start with GIF, BMP files starts with BM, and in JPEG header you will find JFIF. I think unix utility file works this way.
Since you used the term "requirement", I suspect that you're doing a job for someone, possibly as a contract. So make sure that you nail the requirements before worrying about the code.
IMO, you need to get samples of test cases. As others mentioned, failure to load the file as a particular format will be one test. But what about a .jpg that loads ok, but the bottom third is missing? Or a .jpg that loads ok but has green "static" lines in the middle where an error occurred upstream somewhere (on the camera, photoshop, whatever) but then the processing recovered and resumed? In this case, the .jpg may really have green lines in it. Is that considered "corrupt" or not? This is where you need to be careful, especially if it's a contract job.
I have handled this situation by reading the suspicious image and trying to getting its shape. The task is done within try-except block. Following is the code:
import cv2
image = cv2.imread('./image.jpg')
try:
dummy = image.shape # this line will throw the exception
except:
print("[INFO] Image is not available or corrupted.")
This approach should cover all your needs like:
Detecting a corrupted image
Non-image file with an image-type extension detection
Missing image detection etc.
I am using pdftotext opensource tool to convert the PDF to text files. How can I save the text files in UTF-8 format so that I can retain all the accent characters in text files. I am using the below command to convert which extracts the content to text file but not able to see any accented characters.
pdftotext -enc UTF-8 book1.pdf book1.txt
Please help me to resolve this issue.
Thanks in advance,
You can get a list of available encodings using the command:
pdftotext -listenc
and pick the right one using the -enc argument. Mine here seems to do UTF-8 by default. i.e. your "UTF-8" is superflous
pdftotext -enc UTF-8 your.pdf
You may want to check your locale (LC_ALL, LANG, ...).
EDIT:
I downloaded the following PDF:
http://www.i18nguy.com/unicode/unicodeexample.pdf
and converted it on a Windows 7 PC (german) and XPDF 3.02PL5 using the command:
pdftotext.exe -enc UTF-8 unicodeexample.pdf
The text file is definitely UTF-8 encoded, as all characters are displayed correctly. What are you using the text file for? If you're displaying it through a web application, your content encoding might simply be wrong, while the text file has been converted as you wanted it to.
Double-check using either a browser (force the encoding in Firefox to ISO-8859-1 and UTF-8) or using a hex editor.
Things are getting a little bit messy, so I'm adding another answer.
I took the PDF apart and my best guess would be a "problem" with the font used:
open the PDF file in Acrobar Reader
select all the text on the page
copy it and paste it into a Unicode-aware text editor (there's no "hidden" OCR, so you're copying actual data)
You'll see that the codepoints you end up with aren't the ones you're seeing in the PDF reader. Whatever the font is, it may have a mapping different from the one defined in the Unicode standard. As such, your content is "wront" and there's not much you can do about it.
i need to convert rtf document that contains images (jpgs/pngs ) to image format
jpgs or pngs programmaticly , do you have any ideas on how to do it ?
on server side (web)
Thanks
You can use a virtual printing device, for example: http://www.joyprinter.com/
If by programmatically, you mean scripts, you could script your RTF program to open files, then export to PDF, then export the PDF to an image. At least, this kind of operation is relatively easy on OS X. You could probably do it entirely in Automator, using TextEdit and Preview. Otherwise, on OS X you could also try accessing the core services that would do the same thing. No clue on Windows though. Hope that helps!
You might want to write a bash script to be executed by a cronjob. So at a defined time, or after a defined period, you will have your rtf files converted into jpgs.
Though I don't know if this might satisfy your "programmatic" need .. here is how to do this conversion:
To convert rtf files contain "advanced" features like images, as in your case, you need unoconv, which requires libreoffice to be installed.
unoconv -f pdf "${input_file}"
Otherwise, just for reference because it's not your case, if the rtf files contain only simply text you can avoid the requirement to have libreoffice installed by using a cascade conversion like
// convert rtf to txt
unrtf --text "input_file.rtf" > "temp.txt"
// convert txt to pdf
enscript "temp.txt" -o - | ps2pdf - "temp.pdf"
// convert pdf to jpg
convert -quality 100 -append "temp.pdf" "output.jpg"
// remove temp files
trash "temp.txt" "temp.pdf" // or rm if you prefer