How to determine if a photo is corrupted? - image

I have a requirement where in I have to determine whether a photo is corrupted and accordingly tag it as such.
Another thing, I need is to determine if an Image has got wrong extension. What I mean by wrong extension is that sometimes I have come across a photo that has extension of jpg but when I load this photo into IrfanView it reports that the photo is in different format that the extension.
How can I do this in Delphi.

I have a requirement where in I have to determine whether a photo is corrupted and accordingly tag it as such.
You can try some things, but with certain file formats (example: BMP, JPEG to some extent) only a human can ultimately decide if the file is OK or corrupted. The simplest test is to simply load the file into a corresponding object (TJpegImage, TPngObject, etc). If you get an exception while loading you've surely got a corrupted file. Unfortunately if no exception is raised you can't really say the file is not corrupted. I've seen corrupted JPEG files that load just fine into a Delphi TImage and can be opened with Windows's Image Viewer, but are obviously corrupted to a human observer. With BMP images it's even clearer: open up a bitmap, overwrite some bytes in the middle of the file and then open it in a viewer. How can any automated system tell those wrongly colored bits in the middle of the bitmap are actually wrong?
Another thing, I need is to determine if an Image has got wrong extension. What I mean by wrong extension is that sometimes I have come across a photo that has extension of jpg but when I load this photo into IrfanView it reports that the photo is in different format that the extension.
How about doing some of the same, trying to load the file into the object that corresponds to it's extension, and if you fail, try opening up with some other formats? This should be easy.
Alternatively you can investigate image headers: Most file formats start with a short signature, a few bytes. You can look up the documentation of all image file formats and find the signature, or you can simply open up an large number of files and look for a pattern in the first 4 bytes. I'd go for this second alternative since finding proper documentation for all image file formats might be a challenge.

The only way to check if file is corrupted is to try reading it as it is described in file format, ie. load BMP as BMP with reading BMP header, BMP data etc. There are many web pages that describe graphics file formats. Of course if you transmit files and are afraid that it will be corrupted after transmitting then save such files with some sum like CRC32, or even cryptographic MD5 or SHA1. Then after transmitting check if calculated sum is the same as original.
In Delphi there is unit jpeg and types TJPEGImage and TBitmap. Try loading it with data and check exception. For others formats there are many libraries, just look for required file formats.
To check if file extension is good try reading some first bytes of file and check it with some dictionary of graphics file headers. For example GIF files should start with GIF, BMP files starts with BM, and in JPEG header you will find JFIF. I think unix utility file works this way.

Since you used the term "requirement", I suspect that you're doing a job for someone, possibly as a contract. So make sure that you nail the requirements before worrying about the code.
IMO, you need to get samples of test cases. As others mentioned, failure to load the file as a particular format will be one test. But what about a .jpg that loads ok, but the bottom third is missing? Or a .jpg that loads ok but has green "static" lines in the middle where an error occurred upstream somewhere (on the camera, photoshop, whatever) but then the processing recovered and resumed? In this case, the .jpg may really have green lines in it. Is that considered "corrupt" or not? This is where you need to be careful, especially if it's a contract job.

I have handled this situation by reading the suspicious image and trying to getting its shape. The task is done within try-except block. Following is the code:
import cv2
image = cv2.imread('./image.jpg')
try:
dummy = image.shape # this line will throw the exception
except:
print("[INFO] Image is not available or corrupted.")
This approach should cover all your needs like:
Detecting a corrupted image
Non-image file with an image-type extension detection
Missing image detection etc.

Related

Does jpeg compression destroys any embedded malicious code inside an image?

I'm trying to figure out what is the best way to "clean" images that are coming from non authorized source (app visitors) before opening them, similar to Whatsapp.
Scanning each image with anti virus is probably not so efficient in a large scale, So i came to assumption that rewriting each incoming image by compressing it using jpeg could results a clean image without a malicous code inside it.
From what i read so far the JPEG compression should destroy any hidden content and reorder the data structure of the image which will results a safe image.
WTYT? Am i on the right path to overcome this issue?
There is no code in a JPEG stream. In fact, I don't know of any image format that directs the decoder execute code.
The worst think I can thing of would be to have a JPEG stream that, say embedded malicious code in a thumbnail, COM, or APPn marker. Then another application would look for that image and load the code.
Even this requires something else to get on your system to execute the JPEG "code" and it would be a lot of trouble for something that could be accomplished much easier.

Non-deterministic* data in header/beginning of PNG files

I noticed that PNG files created by Gimp from the same RPG data are identical except for the very beginning. This image shows a diff of otherwise identical PNG files created with Gimp:
What is this data which changes each time and how is it encoded? Are there tools to decode it? Can you learn something from this information, e.g. can you find out when a PNG file was (probably) created by this information?
I was under the impression that PNG files are created deterministically* and don't store meta data which isn't necessary to decode the image. (Obviously, the last part is not true, either, as Gimp writes its own name into the files but doesn't ask the user (which is does if you export something as a JPEG file).)
 * I use the word "deterministic" here to refer to things and only such which are the same on each execution/export/whatever given the same input. I'd usually use the word "functional" (i.e. like a mathematical function) but I fear this could be misunderstood by people who don't know what "functional" means in mathematics. Obviously, this is different from the usage of this word in information theory.
See the PNG header definition.
tIME stores the time that the image was last changed, so for me it's the same as the timestamp of the file you create.
bKGD gives the default background color. Possibly the bakcgournd color you are using in Gimp, or the color of the transparent pixels.
tEXT with key Comment and value Created with Gimp is just the default comment. You can change the comment for the image in Image>Properties and you can set a default comment in Edit>Preferences>Default Image
When I export the same PNG twice, I only see a change in tIME. In fact I can't get a bKGD item, even when exporting a PNG with transparent pixels. Are you using any specific options when exporting?

images with invalid jpeg marker

I encountered a website where the images consistently throws 'invalid jpeg marker' error when downloaded. I am wondering is it possible that they are intentionally doing something which causes this error for most of the users who try to download and use their images?
I want to protect the jpeg resources of my website from unauthorised use. Is it possible to really change something in jpeg header or meta tags so that jpg images display fine on browser but if someone downloads it for their own use it throws an error 'invalid jpeg marker'?
(I don't intend to discuss alternative ways of protecting images online or the limitations of it.)
If it can display on a browser, it can display on something else. What is likely happening is that the decoder you are using to view when downloaded is more strict than your web browser. I presume you are not using your browser to view them after downloading.
You could run the "file" command to make sure the images are actually JPEGs. If so, there are a number of programs available to analyze the structure of a JPEG stream. You could use one to see what's going on with the image. You may have some odd marker ordering or possibly the JPEG image does not occur right at the start of the file stream. I've seen some weird JPEGs where there were extra bytes after the APPn marker and before the next JPEG marker. Some decoders ignored the extra bytes others puked.

Decode JPEG image stripped from inside a PDFs file

I have code that decompresses jpgs into bit maps which works fine for JPEG files, however when I feed the code a JPEG I have stripped directly from a PDFs XObject I get errors.
Adobe reader displays the image fine so I don't believe it's corrupted. I have read through JPEG and PDFs documentation and don't find any obvious problems.
My question is this, is there anything different in the "JPEG" embedded inside a PDFs stream and a normal JPEG? And if so what is it?
Note: I can manually open the PDFs, copy the image, paste into paint, and save...when I do this everything works....my problem is I need this automated.
When my code parses the PDFs, strips out the image stream, dumps the binary to a file, and then I try and open this file, it does not work. What am I missing?
My errors seem to be occurring in the Huffman decoding process, the cdt and Huffman tables appear to be read in fine.
Pardon my using the answer section but I overflowed the comment section:
My questions:
1. What code is failing to decode the JPEG? You say you "have code" but where did that come from? Why do you think that it is reliable?
What is the file format of the JPEG stream? JFIF, ADOBE, EXIF, none specified?
Could there be something in the file format that your decoder cannot handle? Does your encoder check for different types of APPn markers?
What is the JPEG format? What type of SOS marker?
Does this encoder source handle all the normally formats? Baseline, Extended, Sequential, progressive? If you have progressive JPEG and and encoder that only does baseline, you are going to have a problem.
How many components does the JPEG stream have?
Some Adobe files have 4 components and decoders may only be able to handle 1 or 3.

How can I store raw data in an image file?

I have some raw data in a file that I would like to store in an image file (bmp, jpg, png, or even gif (eegad)). I would like this to be a two way process: I need to be able to reliably convert the image file back later and get a file that is identical to the original file.
I am not looking for a how-to on steganography; the image file will probably be one pixel wide and millions of pixels high and look like garbage. That is fine.
I looked into the Imagemagick utility convert, but am intimidated by the large number of options and terse man page. I am guessing I could just use this to convert from a 'raw' black channel to png, but would have to specify a bunch of other stuff. Any hints? I would prefer to work within Imagemagick or using Linux utilities.
If you are wondering, there's nothing black hat or cloak and dagger about my request. I simply want to automatically backup some important data to a photo-sharing site.
I'd plow into ImageMagick if that's what you'd prefer anyway.
Specific image formats support storing text data to different degrees, and ImageMagick supports all of the formats you mentioned. I'd choose the one that lets you store what you need.

Resources