Splitting a binary file on binary delimiter?

Splitting a binary file on binary delimiter? - bash

I'm working on a shell script to convert MPO stereographic 3D images into standard JPEG images. A MPO file is just two JPEG images, concatenated together.
As such, you can split out the JPEG files by finding the byte offset of the second JPEG's magic number header (0xFFD8FFE1). I've done this manually using hexdump/xxd, grep, head, and tail.
The problem here is grep: what can I use to search a binary directly for a specific magic number, and get back a byte offset? Or should I not use a shell script for this at all? Thanks.

You can do this using bbe (http://bbe-.sourceforge.net/) which is a sed like program for binary files:
In order to extract the first JPEG use:
bbe -b '/\xFF\xD8\xFF\xE1/:' -e 'D 2' -o first_jpeg mpo_file
And for the second one:
bbe -b '/\xFF\xD8\xFF\xE1/:' -e 'D 1' -o second_jpeg mpo_file
Note that this will not work if the JPEG's magic number occurs somewhere else in the MPO file.

I think that Bart is on to your biggest problem.. If that binary sequence repeats during the process, you will get partial JPEGs.
I did a quick test by concatenating some JPEGs and then extracting them with awk (please note that the magic number in my files ended in 0xE0 and not 0xE1):
# for i in *.jpg ; do cat $i ; done > test.mpo
# awk 'BEGIN {RS="\xFF\xD8\xFF\xE0"; FILENUM=-1} {FILENUM++; if (FILENUM == 0) {next}; FILENAME="image0"FILENUM".jpg"; printf "%s",RS$0 > FILENAME;}' test.mpo
# file image0*.jpg
image01.jpg: JPEG image data, JFIF standard 1.01
image010.jpg: JPEG image data, JFIF standard 1.01
image011.jpg: JPEG image data, JFIF standard 1.01
This seemed to work ok for me, but the above mentioned issues are still unhandled and very real.

I've found a much better explanation of MPO file structure (and how to process it correctly) at http://www.davidglover.org/2010/09/using-the-fuji-finepix-real-3d-w3-camera-on-a-mac-or-linuxunix.html
Edit, October 2019:
Since the blog entry now 404s, here is the script that I wrote based on it. I haven't used it in many years.
#!/usr/bin/env bash
# Script to convert 3D MPO files, as used in the Fuji FinePix series of 3D cameras, into standard JPEG files.
# Based on work by David Glover, posted at http://www.davidglover.org/2010/09/using-the-fuji-finepix-real-3d-w3-camera-on-a-mac-or-linuxunix.html
# This script requires exiftool and ImageMagick.
FULLNAME="$1"
FILENAME="$(basename $FULLNAME)"
DIRNAME="$(dirname $FULLNAME)"
BASENAME="${FILENAME%.*}"
# Create output directories
mkdir -p "$DIRNAME"/stereoscopic-rl/
mkdir -p "$DIRNAME"/stereoscopic-mpo/
mkdir -p "$DIRNAME"/stereoscopic-anaglyph/
mkdir -p "$DIRNAME"/monoscopic-l/
mkdir -p "$DIRNAME"/monoscopic-r/
# Create separate left and right images
exiftool -trailer:all= "$FULLNAME" -o "$DIRNAME"/monoscopic-l/"$BASENAME"-left.jpg
exiftool "$FULLNAME" -mpimage2 -b > "$DIRNAME"/monoscopic-r/"$BASENAME"-right.jpg
# Move the MPO file to its new home
mv "$FULLNAME" "$DIRNAME"/stereoscopic-mpo/
# Determine parallax value and create cropped images for stereo generation
# 36 is only appropriate for 4:3 or 3:2 images
parallax=$(exiftool -b -Parallax "$DIRNAME"/monoscopic-r/"$BASENAME"-right.jpg)
parallax=$(echo "$parallax"*36+0.5 | bc | cut -d . -f 1)
# The above pipeline can't deal with a parallax of zero
# In theory, this fix doesn't cover values between zero and -1
# TODO improve the calculation
if [ ! $parallax ]; then
parallax=0
fi
echo $parallax
if [ $parallax -ge 0 ]; then
convert "$DIRNAME"/monoscopic-l/"$BASENAME"-left.jpg -crop +"$parallax"+0 "$DIRNAME"/monoscopic-l/"$BASENAME"-left-cropped.jpg
convert "$DIRNAME"/monoscopic-r/"$BASENAME"-right.jpg -crop -"$parallax"+0 "$DIRNAME"/monoscopic-r/"$BASENAME"-right-cropped.jpg
else
convert "$DIRNAME"/monoscopic-l/"$BASENAME"-left.jpg -crop -"$((-1*$parallax))"+0 "$DIRNAME"/monoscopic-l/"$BASENAME"-left-cropped.jpg
convert "$DIRNAME"/monoscopic-r/"$BASENAME"-right.jpg -crop +"$((-1*$parallax))"+0 "$DIRNAME"/monoscopic-r/"$BASENAME"-right-cropped.jpg
fi
# Create stereoscopic images for cross-eye (right-left) and anaglyph (red-cyan) viewing
convert "$DIRNAME"/monoscopic-r/"$BASENAME"-right-cropped.jpg "$DIRNAME"/monoscopic-l/"$BASENAME"-left-cropped.jpg +append "$DIRNAME"/stereoscopic-rl/"$BASENAME"-stereoscopic-rl.jpg
composite -stereo 0 "$DIRNAME"/monoscopic-r/"$BASENAME"-right-cropped.jpg "$DIRNAME"/monoscopic-l/"$BASENAME"-left-cropped.jpg "$DIRNAME"/stereoscopic-anaglyph/"$BASENAME"-stereoscopic-anaglyph.jpg
# Clean up separated parallax-corrected images
rm "$DIRNAME"/monoscopic-l/"$BASENAME"-left-cropped.jpg
rm "$DIRNAME"/monoscopic-r/"$BASENAME"-right-cropped.jpg
exit 0

I think a very simple home brew approach will be your best bet. The code for doing this would be very small, depending on all the special cases of your binary file format.
Use mmap to get a convenient view of your file in memory.
Start scanning, and save the byte-offset in a variable, say start.
Scan until you reach your delimiter, saving the ending offset, in say end.
Create a new file
Memory-map the new file
Copy the byte-range from start to end into the new file.
Close the new file and start scanning again.

FFE1 is not part of the some jpeg "magic number", it's the APP1 marker. And it's not guaranteed to come right after the SOI marker FFD8. Also, you should be careful that some jpeg images embed a thumbnail jpeg in an EXIF block. That will most likely also contain an APP1 marker.

Related

gdal_translate only translating first three bands from .vrt to .tif

I'm trying to translate a set of .tif images to one multiband .tif following this tutorial. I'm not getting any errors but every time I run the script it generates a three-band image when I have a lot more input .tifs than that. I have tried the -r flag and it didn't change anything.
I have:
wdir="/Users/<mydir>"
creoleLA="$wdir/data/s2l2a_tiffs/T15RVP"
dhaka="$wdir/data/s2l2a_tiffs/T45QZG"
easthouston="$wdir/data/s2l2a_tiffs/T15RTN"
kolkata="$wdir/data/s2l2a_tiffs/T45QXF"
murcia="$wdir/data/s2l2a_tiffs/T30SXH"
wuhan="$wdir/data/s2l2a_tiffs/T50RKU"
#=============================================
# Main script
#=============================================
read -p 'ROI: ' roi
indir=${!roi}
outdir="$wdir/data/s2l2a_tiffs"
ls ${indir}*"_B"*.tif > "$outdir/btif_${roi}.txt"
gdalbuildvrt -separate -overwrite -input_file_list "$outdir/btif_${roi}.txt" "$outdir/S2L2A_${roi}.vrt"
gdal_translate -strict -ot uint16 "$outdir/S2L2A_${roi}.vrt" "$outdir/S2L2A_${roi}_mb.tif"
rm "$outdir/S2L2A_${roi}.vrt"
rm "$outdir/btif_${roi}.txt"
Where "$outdir/btif_${roi}.txt" is a textile of the GeoTIFF file paths like this:
/Users/<mydir>/data/s2l2a_tiffs/T15RVP_20200831T164849_B01_60m.tif
/Users/<mydir>/data/s2l2a_tiffs/T15RVP_20200831T164849_B02_60m.tif
...
I am processing Sentinel-2 imagery and using OSX 11.6.

ImageMagick: guess raw image height

I'm using convert utility from ImageMagick to convert raw image bytes to usable image format such as PNG. My raw files are generated by code, so there is no any headers, just pure pixels.
In order to convert my image I'm using command:
$ convert -depth 1 -size 576x391 -identify gray:image.raw image.png
gray:image.raw=>image.raw GRAY 576x391 576x391+0+0 1-bit Gray 28152B 0.010u 0:00.009
The width is fixed and pretty known for me. However I have to evaluate the height of the image from the file size each time which is annoying.
Without height specified or if wrong height is specified the utility compains:
$ convert -depth 1 -size 576 -identify gray:image.raw image.png
convert-im6.q16: must specify image size `image.raw' # error/gray.c/ReadGRAYImage/143.
convert-im6.q16: no images defined `image.png' # error/convert.c/ConvertImageCommand/3258.
$ convert -depth 1 -size 576x390 -identify gray:iphone.raw iphone.png
convert-im6.q16: unexpected end-of-file `image.raw': No such file or directory # error/gray.c/ReadGRAYImage/237.
convert-im6.q16: no images defined `image.png' # error/convert.c/ConvertImageCommand/3258.
So I wonder is there a way to automatically detect the image height based on the file/blob size?

A couple of ideas...
You may not be aware of the NetPBM format, but it is very simple and you may be able to change your software that creates the raw images so that it directly generates PBM format images which are readable and useable by OpenCV, Photoshop, GIMP, feh, eog and ImageMagick of course. It would not require any libraries or extra dependencies in your software, all you need to do is put a textual PBM header on the front, so your file looks like this:
P4
576 391
... YOUR EXISTING BINARY DATA ...
Do not forget to put newlines (i.e. linefeed character) after P4 and after 391.
You can try it for yourself and add a header onto one of your files like this and then view it with GIMP or other tool:
printf "P4\n576 391\n" > image.pbm
cat image.raw >> image.pbm
If you prefer a one-liner, just use a bash command grouping like this - which is equivalent to the 2 lines above:
{ printf "P4\n576 391\n"; cat image.raw; } > image.pbm
Be careful to have all the spaces and semi-colons exactly as I have them!
Another idea, just putting some meat on Fred's answer, might be the following one-liner which uses a bash arithmetic context and a bash command substitution, you can do this:
convert -depth 1 -size "576x$(($(stat -c "%s" image.raw)*8/576))" gray:image.raw image.png
Note that if you are on macOS, stat is a little different, so you may prefer the slightly less efficient, but more portable:
convert -depth 1 -size "576x$(($(wc -c < image.raw)*8/576))" gray:image.raw image.png

You have to know the -depth and width to compute the height for ImageMagick raw format. If depth is 1, then your image is binary (b/w). So height = 8 * file size (in B)/(width). 28152*8/391 = 576

Batch resize images when one side is too large (linux)

I know that image resizing on the command line is something ImageMagick and similar could do unfortunately I do only have very basic bash scripting abilities so I wonder if this is even possible:
check all directories and subdirectories for all files that are an image
check width and height of the image
if any of both exceeds X amount of pixels resize it to X while keeping aspect ratio.
replace old file with new file (old file shall be removed/deleted)
Thank you for any input.

Implementation might be not so trivial even for advanced users. As a one-liner:
find \ # 1
~/Downloads \ # 2
-type f \ # 3
-exec file \{\} \; \ # 4
| awk -F: '{if ($2 ~/image/) print $1}' \ # 5
| while IFS= read -r file_path; do \ # 6
mogrify -resize 1024x1024\> "$file_path"; \ # 7
done # 8
Lines 1-4 are an invocation of the find command:
Specify a directory to scan.
Specify you need files only.
Per each found item run file command. Example outputs per file:
/Downloads/391A6 625.png: PNG image data, 1024 x 810, 8-bit/color RGB, interlaced
/Downloads/STRUCTURED NODES IN UML 2.0 ACTIVITES.pdf: PDF document, version 1.4
Note how file names are delimited from their info by : and info about PNG contains image word. This also will be true for other image formats.
Use awk to filter only those files which have image word in their info. This gives us image files only. Here, -F: specifies that the delimiter is :. This gives us the variable $1 to contain the original file name and $2 for the file info. We search image word in file info and print file name if it's present.
This one is a bit tricky. Lines 6-8 read the output of awk line by line and invoke the mogrify command to resize images. Here we do not use piping and xargs, as if file paths contain spaces or other characters which must be escaped,
we will get xargs unterminated quote errors and it's a pain to handle that.
Invoke the mogrify command of ImageMagic. Unlike convert, which is also ImageMagic's command, mogrify changes files in-place without creating new ones. Here, 1024x1024\> tells to resize image to have max size of 1024x1024. The \> part tells to preserve aspect ratio, so that the final image will have the biggest side of 1024px. Other side will be smaller than that, unless the original image is square. Pay attention to the ;, as it's needed inside loops.
Note, it's safe to run mogrify several times over the same file: if a file's size already corresponds to your target dimensions, it will not be resized again. However, it will change file's modification time, though.
Additionally, you may need not only to resize images, but to compress them as well. Please, refer to my gist to see how this can be done: https://gist.github.com/oblalex/79fa3f85f05924017d25004496493adb
If your goal is just to reduce big images in size, e.g. bigger than 300K, you may:
find /path/to/dir -type f -size +300k
and as before combine it with mogrify -strip -interlace Plane -format jpg -quality 85 -define jpeg:extent=300KB "$FILE_PATH"
In such case new jpg files will be created for non-jpg originals and originals will need to be removed. Refer to the gist to see how this can be done.

You can do that with a bash unix shell script looping over your directories. You must identify all the file formats you want such as jpg and png, etc. Then for each directory, loop over each file of the given list of formats. Then use ImageMagick to resize the files.
cd
dirlist="path2/directory1 path2/directory2 ...."
for dir in $dirlist; do
cd "$dir"
imglist=`ls | grep -i ".jpg\|.png"`
for img in $imglist; do
convert $img -resize "200x200>" $img
done
done
See https://www.imagemagick.org/script/command-line-processing.php#geometry

ImageMagick convert tiffs to pdf with sequential file suffix

I have the following scenario and I'm not much of a coder (nor do I know bash well). I don't even have a base working bash script to share, so any help would be appreciated.
I have a file share that contains tiffs (thousands) of a document management system. The goal is to convert and combine from multiple file tiffs to single file pdfs (preferably PDF/A 1a format).
The directory format:
/Document Management Root # This is root directory
./2009/ # each subdirectory represents a year
./2010/
./2011/
....
./2016/
./2016/000009.001
./2016/000010.001
# files are stored flat - just thousands of files per year directory
The document management system stores tiffs with sequential number file names along with sequential file suffixes:
000009.001
000010.001
000011.002
000012.003
000013.001
Where each page of a document is represented by the suffix. The suffix restarts when a new, non-related document is created. In the example above, 000009.001 is a single page tiff. Files 000010.001, 000011.002, and 000012.003 belong to the same document (i.e. the pages are all related). File 000013.001 represents a new document.
I need to preserve the file name for the first file of a multipage document so that the filename can be cross referenced with the document management system database for metadata.
The pseudo code I've come up with is:
for each file in {tiff directory}
while file extension is "001"
convert file to pdf and place new pdf file in {pdf directory}
else
convert multiple files to pdf and place new pd file in {pdf directory}
But this seems like it will have the side effect of converting all 001 files regardless of what the next file is.
Any help is greatly appreciated.
EDIT - Both answers below work. The second answer worked, however it was my mistake in not realizing that the data set I tested against was different than my scenario above.

So, save the following script in your login ($HOME) directory as TIFF2PDF
#!/bin/bash
ls *[0-9] | awk -F'.' '
/001$/ { if(NR>1)print cmd,outfile; outfile=$1 ".pdf"; cmd="convert " $0;next}
{ cmd=cmd " " $0}
END { print cmd,outfile}'
and make it executable (necessary just once) by going in Terminal and running:
chmod +x TIFF2PDF
Then copy a few documents from any given year into a temporary directory to try things out... then go to the directory and run:
~/TIFF2PDF
Sample Output
convert 000009.001 000009.pdf
convert 000010.001 000011.002 000012.003 000010.pdf
convert 000013.001 000013.pdf
If that looks correct, you can actually execute those commands like this:
~/TIFF2PDF | bash
or, preferably if you have GNU Parallel installed:
~/TIFF2PDF | parallel
The script says... "Generate a listing of all files whose names end in a digit and send that list to awk. In awk, use the dot as the separator between fields, so if the file is called 00011.0002, then $0 will be 00011.0002, $1 will be 00011 and $2 will be 0002. Now, if the filename ends in 0001, print the accumulated command and append the output filename. Then save the filename prefix with PDF extension as the output filename of the next PDF and start building up the next ImageMagick convert command. On subsequent lines (which don't end in 0001), add the filename to the list of filenames to include in the PDF. At the end, output any accumulated commands and append the output filename."
As regards the ugly black block at the bottom of your image, it happens because there are some tiny white specks in there that prevent ImageMagick from removing the black area. I have circled them in red:
If you blur the picture a little (to diffuse the specks) and then get the size of the trim-box, you can apply that to the original, unblurred image like this:
trimbox=$(convert original.tif -blur x2 -bordercolor black -border 1 -fuzz 50% -format %# info:)
convert original.tif -crop $trimbox result.tif
I would recommend you do that first to A COPY of all your images, then run the PDF conversion afterwards. As you will want to save a TIFF file but with the extension 0001, 0002, you will need to tell ImageMagick to trim and force the output filetype to TIF:
original=XYZ.001
trimbox=$(convert $original -blur x2 -bordercolor black -border 1 -fuzz 50% -format %# info:)
convert $original -crop $trimbox TIF:$original
As #AlexP. mentions, there can be issues with globbing if there is a large number of files. On OSX, ARG_MAX is very high (262144) and your filenames are around 10 characters, so you may hit problems if there are more than around 26,000 files in one directory. If that is the case, simply change:
ls *[0-9] | awk ...
to
ls | grep "\d$" | awk ...

The following command would convert the whole /Document Management Root tree (assuming it's actual absolute path) properly processing all subfolders even with names including whitespace characters and properly skipping all other files not matching the 000000.000 naming pattern:
find '/Document Management Root' -type f -regextype sed -regex '.*/[0-9]\{6\}.001$' -exec bash -c 'p="{}"; d="${p:0: -10}"; n=${p: -10:6}; m=10#$n; c[1]="$d$n.001"; for i in {2..999}; do k=$((m+i-1)); l=$(printf "%s%06d.%03d" "$d" $k $i); [[ -f "$l" ]] || break; c[$i]="$l"; done; echo -n "convert"; printf " %q" "${c[#]}" "$d$n.pdf"; echo' \; | bash
To do a dry run just remove the | bash in the end.
Updated to match the 00000000.000 pattern (and split to multiple lines for clarity):
find '/Document Management Root' -type f -regextype sed -regex '.*/[0-9]\{8\}.001$' -exec bash -c '
pages[1]="{}"
p1num="10#${pages[1]: -12:8}"
for i in {2..999}; do
nextpage=$(printf "%s%08d.%03d" "${pages[1]:0: -12}" $((p1num+i-1)) $i)
[[ -f "$nextpage" ]] || break
pages[i]="$nextpage"
done
echo -n "convert"
printf " %q" "${pages[#]}" "${pages[1]:0: -3}pdf"
echo
' \; | bash

Use exiv2 or imagemagick to remove EXIF data from stdin and output to stdout

How can I pipe an image into exiv2 or imagemagick, strip the EXIF tag, and pipe it out to stdout for more manipulation?
I'm hoping for something like:
exiv2 rm - - | md5sum
which would output an image supplied via stdin and calcualte its md5sum.
Alternatively, is there a faster way to do this?

Using exiv2
I was not able to find a way to get exiv2 to output to stdout -- it only wants to overwrite the existing file. You could use a small bash script to make a temporary file and get the md5 hash of that.
image.sh:
#!/bin/bash
cat <&0 > tmp.jpg # Take input on stdin and dump it to temp file.
exiv2 rm tmp.jpg # Remove EXIF tags in place.
md5sum tmp.jpg # md5 hash of stripped file.
rm tmp.jpg # Remove temp file.
You would use it like this:
cat image.jpg | image.sh
Using ImageMagick
You can do this using ImageMagick instead by using the convert command:
cat image.jpg | convert -strip - - | md5sum
Caveat:
I found that stripping an image of EXIF tags using convert resulted in a smaller file-size than using exiv2. I don't know why this is and what exactly is done differently by these two commands.
From man exiv2:
rm Delete image metadata from the files.
From man convert:
-strip strip image of all profiles and comments
Using exiftool
ExifTool by Phil Harvey
You could use exiftool (I got the idea from https://stackoverflow.com/a/2654314/3565972):
cat image.jpg | exiftool -all= - -out - | md5sum
This too, for some reason, produces a slightly different image size from the other two.
Conclusion
Needless to say, all three methods (exiv2, convert, exiftool) produce outputs with different md5 hashes. Not sure why this is. But perhaps if you pick a method and stick to it, it will be consistent enough for your needs.

I tested with NEF file. Seems only
exiv2 rm
works best. exiftool and convert can't remove all metadata from .nef FILE.
Notice that the output file of exiv2 rm can no longer be displayed by most image viewers. But I only need the MD5 hash keeps same after I update any metadata of the .NEF file. It works perfect for me.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio