Remove all metadata from an image on the command line - image

So I used exiftool -all= command line tool to remove the metadata from an image. However, when I print the metadata of the resulting image, I get this:
$ exiftool myimage.jpg
ExifTool Version Number : 11.30
File Name : myimage.jpg
Directory : out
File Size : 2.8 MB
File Modification Date/Time : 2019:05:16 03:34:02-07:00
File Access Date/Time : 2019:05:16 03:34:02-07:00
File Inode Change Date/Time : 2019:05:16 03:34:02-07:00
File Permissions : rw-r--r--
File Type : JPEG
File Type Extension : jpg
MIME Type : image/jpeg
DCT Encode Version : 100
APP14 Flags 0 : [14]
APP14 Flags 1 : (none)
Color Transform : YCbCr
Image Width : 3729
Image Height : 2246
Encoding Process : Baseline DCT, Huffman coding
Bits Per Sample : 8
Color Components : 3
Y Cb Cr Sub Sampling : YCbCr4:4:4 (1 1)
Image Size : 3729x2246
Megapixels : 8.4
I am wondering a few things:
If it is required at some level to have all of this (albeit minimal) metadata. That is, I'm wondering if we could get even more minimal and really remove all metadata.
If we can't remove all the metadata that remains, I'm wondering if I can at least remove the first 3 attributes (ExifTool Version Number, File Name, and Directory).
If any of this is possible, wondering what tool (preferrably command-line tool) could accomplish this.

Almost all that remaining data is not metadata embedded in the file. They are properties of the image or the underlying OS. Or even in the case of ExifTool Version Number, the version of exiftool you are running. The '-g' and '-G' options can show you which group each tag comes from; in particular, group families 0 and 1 can be used to see the tags that come from ExifTool and file attributes, as well as those embedded in the image file itself.
The items such as permissions, file name, directory, and the time stamps are taken directly from the underlying OS. Those are properties of every single file on the drive. Without them, the file itself does not exist.
The file/Mime type entries are properties of the file that exiftool has created when it figured out what kind of file it is.
Except for the APP14 entries, the rest are data about the image itself. How it is encoded, the format of the encoding blocks, the size of the image, etc.
The only thing that is embedded in this image is the APP14 block. This block contains no data that can identify the origin of the image. But there is a chance that removing it will significantly alter the colors of the image (see this post). It can be removed by adding -Adobe:All= to the command.

Related

EOFError when converting gensim word2vec to binary format

I have a pretrained embeddings with word2vec format in txt. I loaded it and then saved it to .bin. But I cannot load this embeddings as an EOFError: unexpected end of input; is count incorrect or file otherwise damaged?
My original code is:
model = KeyedVectors.load_word2vec_format(wordfile)
model.save_word2vec_format("file.bin",binary=True,write_header=True)
bin_model = KeyedVectors.load_word2vec_format("file.bin",binary=True)
And I can load this file.bin with a limit arguement: KeyedVectors.load_word2vec_format("file.bin",binary=True, limit=10000).
Is there some other process needed when I save embeddings?
There's a good chance that your .bin file has an incorrect leading-count, or the file has been otherwise been damaged/truncated – because that error means the file declared in its header (1st line) a larger number of word-vectors than were found during attempted-load.
So, if you downloaded it or copied it from somewhere, check the original source, to make sure you've got the full file.
Is there a reason you're performing this conversion? The formats are essentialy equivalent, and result in the exact same object-in-Python after loading.
If there's any tiny on-disk size savings in binary-format, you could probably save more by GZIPping the file (which the .load_word2vec_format() will also happily decompress, if it sees a trailing .gz on the filename).

Extracting binary data with Exiftool

I am doing a steganography challenge. I have a jpeg file, and when I look the metadata of it with Exiftool, I see something interesting.
Red Tone Reproduction Curve : (Binary data 64 bytes, use -b option to extract)
Green Tone Reproduction Curve : (Binary data 64 bytes, use -b option to extract)
Blue Tone Reproduction Curve : (Binary data 64 bytes, use -b option to extract)
So I thought that maybe there is something interesting there, but I am not sure on how to extract that binary data. When I use the following command: exiftool -b -RedTRC nameOfThePicture.jpeg, then I get this message: curv ck?Q!)2;FQw]kpz|i}0.
I think that message can be something but I don't how to extract that binary data and which type of file could it be. Any help?
That "message" is simply the binary data of that tag displayed on the command line. The curv part is the ID value for the TRC (see example code, specifically tagbase.sig = swap((long)0x63757276L); /*’curv’*/).
To extract that data to a file, you just need to redirect the output.
exiftool -b -RedTRC nameOfThePicture.jpeg >RedTRC.dat

Encoding PNG image with specific filter type using Imagemagick

Just trying to re-encode a PNG file with a specific filtering method using image magick command line tool ('convert' tool to be specific).
As per this page , we can set the quality parameter using two decimal figures , first figure being the zlib compression level used when 1-9 is used and the second figure indicates filter method to use (which is used before the compression)
But with the above option used always filter type '0' (i.e no filter) is used when encoding , even when I specify filter type other than '0'.
eg with below command
convert -quality 11 in.png out.png
I should get output image with zlib compression level 1 (which I am not bothered about) and filter type '1' ('sub' type of filtering).
But the out image is always with filter type '0' , even if I change quality level to 11 , 12 ..15 etc (corresponding to filters '1' , '2' ...'5' etc).
Also tried with controls as per this page , but results are same.
Helps are appreciated!
The "filter method" in the IHDR chunk of any PNG file is always 0.
What ImageMagick/GraphicsMagick's "quality" digit controls is the
filter type used in each scanline. So "-quality 11" will produce
a PNG file that has filter_method=0 in IHDR, and "filter_type=1" as
the first byte of each row.
If you have "pngtest" that comes with libpng, you can find out which
filter types are used:
pngtest -m out.png
yields for example
PASS (3396 zero samples)
Filter 1 was used 24 times
Filter 2 was used 478 times
Filter 3 was used 2 times
Filter 4 was used 241 times
libpng passes test
EDIT: There does seem to have been a fleeting problem with ImageMagick. Under version 7.0.2-9, "-quality 11"
is producing an image with all rows having filter_type 0 as you noted.
"-quality 15" does work properly, with all four filter_types being present.
GraphicsMagick, ImageMagick-7.0.3-0, and ImageMagick-6.9.5-9 are working properly.

Convert PDF files to PDF/A via Ghostscript

I'd like to convert arbitrary PDF files to PDF/A with Ghostscript 9.15.
Is Ghostscript able to create PDF/A-3b conformant PDFs? There is no parameter which represents a PDF/A conformance level, so I assume there is no possibility. Or is there anything I have overlooked?
I was following a blog post where a Windows batch file is used to convert from PDF to PDF/A (see http://www.mcbsys.com/techblog/2013/04/batch-convert-pdf-to-pdfa/). The gs invokation in the batch is:
"%gs_path%\gswin64c" ^
-dPDFA ^
-dNOOUTERSAVE ^
-sProcessColorModel=DeviceRGB ^
-sDEVICE=pdfwrite ^
-o "GS_%file1%" ^
-dPDFACompatibilityPolicy=1 ^
"%currentdir%\PDFA_def.ps" ^
%inputfilelist%
The PDFA_def.ps is an adjusted version of the official one:
%!
% This prefix file for creating a PDF/A document is derived from
% the sample included with Ghostscript 9.07, released under the
% GNU Affero General Public License.
% Modified 4/15/2013 by MCB Systems.
% Feel free to modify entries marked with "Customize".
% This assumes an ICC profile to reside in the file (AdobeRGB1998.icc),
% unless the user modifies the corresponding line below.
% The color space described by the ICC profile must correspond to the
% ProcessColorModel specified when using this prefix file (GRAY with
% DeviceGray, RGB with DeviceRGB, and CMYK with DeviceCMYK).
% Define entries in the document Info dictionary :
/ICCProfile (... PATH TO ... AdobeRGB1998.icc) % Customize.
def
[ /Title (Title) % Customize.
/DOCINFO pdfmark
% Define an ICC profile :
[/_objdef {icc_PDFA} /type /stream /OBJ pdfmark
[{icc_PDFA} <</N systemdict /ProcessColorModel get /DeviceGray eq {1} {systemdict /ProcessColorModel get /DeviceRGB eq {3} {4} ifelse} ifelse >> /PUT pdfmark
[{icc_PDFA} ICCProfile (r) file /PUT pdfmark
% Define the output intent dictionary :
[/_objdef {OutputIntent_PDFA} /type /dict /OBJ pdfmark
[{OutputIntent_PDFA} <<
/Type /OutputIntent % Must be so (the standard requires).
/S /GTS_PDFA1 % Must be so (the standard requires).
/DestOutputProfile {icc_PDFA} % Must be so (see above).
/OutputConditionIdentifier (AdobeRGB1998) % Customize
>> /PUT pdfmark
[{Catalog} <</OutputIntents [ {OutputIntent_PDFA} ]>> /PUT pdfmark
So, I use AdobeRGB1998.icc which is obviously useable for PDF files with RGB color space. Depending on the -sProcessColorModel value (DEVICERGB) a correct value is printed out.
The conversion works for all files. But when I validate the created PDF file against PDF/A-1b, I get different results depending whether the input file has RGB color space or not (e.g. CMYK). So, when I have an input PDF file which uses CMYK color space, the file gets converted by the script, but the validator says something like this:
input.pdf", 1, 38, 0x03418614, "A device-specific color space (DeviceCMYK) without an appropriate output intent is used.", 1
"output.pdf", 20, 0, 0x83410612, "The document does not conform to the requested standard.", 1
My question: Is there a way to get the conversion done for arbitrary files (i.e. independent of the used color space in the input file)?
Update
#KenS Thanks for your answer. I've updated my initial post to clarify what I want to achieve.
To make it more explicit, I will use an example. There are two files: input1.pdf (seems to use RGB) and input2.pdf (seems to use CMYK). I want to convert both of them to PDF/A-1. Thanks to your hint, I've let go of the above mentioned batch script and instead tested the command directly in the command line. After reading Ps2pdf.htm#PDFA, I have adjusted the (official) PDFA_def.ps so that AdobeRGB1998.icc is used. Then I invoked the following command on both input files (replaced output1.pdf by output2.pdf and input1.pdf by input2.pdf for the second file):
gswin64c.exe -dPDFA=1 -dBATCH -dNOPAUSE -dNOOUTERSAVE \
-sColorConversionStrategy=/RGB \
-sOutputICCProfile=AdobeRGB1998.icc -sDEVICE=pdfwrite \
-sOutputFile=output1.pdf -dPDFACompatibilityPolicy=1 \
"PATH/TO/OFFICIAL/PDFA_def.ps" input1.pdf
The conversion was done without any errors. The output1.pdf seems to be valid, but the output2.pdf is still invalid (tested with 3heights Validator):
"output2.pdf", 1, 40, 0x03418614, "A device-specific color space (DeviceCMYK) without an appropriate output intent is used.", 1
"output2.pdf", 20, 0, 0x83410612, "The document does not conform to the requested standard.", 1
So when I understand your answer correctly, the above command should produce a pdf file which uses the RGB color space - independent of the color space of the input file. If the input file uses CMYK, than the colors have to be translated into RGB with the above command.
When I interpret the first error message correctly, the used color space in the output2.pdf is still CMYK (although the command parameters like ColorConversionStrategy=/RGB). Since I used AdobeRGB1998.icc, the validation error appears.
What am I missing in the above command?
Going back to my original question (which is one step further): Instead of always converting to RGB (or CMYK), I wanted to somehow detect which color space is used in the input file and then dynamically switch to a RGB or CMYK icc file. Is it possible to achieve that?
Ghostscript does not support PDF/A-3. The conformance parameter you are looking for is -dPDFA= where valid values are nothing (defaults to 1), 1 or 2. You can find this documented in ghostpdl/gs/doc/ps2pdf/htm#PDFA
I'm not sure what you are asking for here though. You must either create a PDF/A file (in level 1 or 2 anyway, I haven't read the revision 3 spec yet) which is RGB or CMYK, because you aren't allowed to use both (you can convert everything to device independent colour of course). The colour space used in the input isn't relevant, other than to decide whether it needs to be converted.
This is something you need to decide, we can't decide it for you. One important reason is that the OutputIntent must be consistent with either RGB or CMYK, and the pdfwrite device doesn't check it, it assumes you chose one which matches the device space you are using for the PDF file (by the way, don't set the ProcessColorModel, use ColorConversionStrategy instead) In your case you have set OutputIntent to AdobeRGB1988 so your colours must be specified either in device independent colour, or RGB.
Given the errors you quote, I would suggest the problem is that you haven't specified -sColorConversionStrategy, so the input colours are not being converted to the required device space. I would further guess that the script you copied this from set -dUseCIEColor, and you didn't copy that bit. DO NOT set -dUseCIEColor, its a horrbile ancient piece of PostScript hackery. Instead set ColorConversionStrategy, which will convert colours in a much better way, as required.
Updated answer as this started getting too long for a comment:
I can't immediately see any problems with your command line, can you share an example PDF file ? Its much easier to investigate these things with a solid example. I know from our customers and other free users that pdfwrite is capable of producing conforming PDF/A-1b files.
Regarding the second question; its not possible to do that because currently you need to set the OutputIntentProfile to either a CMYK one or an RGB one before you start. You can't just run through the input PDF file until you come to a colour operation and then decide. If you feel like some programming it could be done by modifying pdfwrite, because the profile isn't actually used till the output is closed.
One problem is that, in order to do the colour conversion, you need to set the underlying ProcessColorModel (this is done for you automatically by ColorConversionStategy). The only way to change ProcessColorModel is to execute a setpagedevice, which causes an erasepage. Now I think that's actually fixable with pdfwrite, all it does is write a white rectangle over the page, so you should be able to intercept that and not emit it. Otherwise any marks you made before you encountered an RGB or CMYK operation would be underneath the white rectangle.....
So essentially no, you can't do it right now, if its important to you then you could probably modify the code to do so (don't forget you will also need to supply 2 OutputIntent profiles to choose between as well). We've never had a customer request to do this, so we won't likely take it on as a project. Of course if you did get this working we might very well incorporate it into the code base if you were to offer it back to us.

Reading MP3 audio data, or calculating the checksum thereof

Is there a Ruby library that will allow me to either calculate the checksum of an MP3 file's audio data (minus the metadata) or allow me to read in an MP3's audio data to calculate the checksum myself?
I'm looking for something like this:
mp3 = Mp3Lib::MP3.new('/path/to/song.mp3')
mp3.audio.sha1sum # => the sha1 checksum of _only_ the audio, minus the metadata
I found Mp3Info, but it seems a bit tedious. When initializing an Mp3Info object, you can get the frames where the actual audio data begins and ends.
Extracting the mp3file without it's metadata is fairly easy done by yourself.
ID3v1
The metadata are the last 128 bytes of the file. The metadata always begins with the 3 bytes "TAG" if it exists. Just ignore this last 128bytes.
ID3v2
The metadata can be stored at the beginning or the end of the file. Most implemantations only support the beginning. ID3v2 has a header where the size is stored. The header is always loacted at the beginning of the metadata. There is an optional footer, which is a copy of the header at the end of the metadata. If the metadata is at the end of the file, the footer is required.
The header has the folloing form
ID3v2/file identifier "ID3"
ID3v2 version $04 00
ID3v2 flags %abcd0000
ID3v2 size 4 * %0xxxxxxx
The footer has the following form
ID3v2/file identifier "3DI"
ID3v2 version $04 00
ID3v2 flags %abcd0000
ID3v2 size 4 * %0xxxxxxx
The d bit says, wheter the footer is present. The size is measured without header and footer. Every byte of the size has always the highest bit set. So only 28 of the acutal 32 bits represent the size.
Just compute, which part of the file is not the metadata, and use it for your hashing.
Be aware, if both ID3v1 and ID3v2 are located at the end of the file, ID3v1 is located behind IDv2
The spec can be found at http://www.id3.org/id3v2.4.0-structure.
Isn't the ID3 tag stored either at the end of the file (ID3 v1) in a 128-byte block, or in a block at the start of the file (ID3v2.3 and v2.4) ? (id3.org)
You could use the audio_content method from Mp3Info, and read that much data from the file, though it's probably not much more complicated to have a look in the file yourself and work out where the headers aren't.

Resources