extract raster and vector images from pdf with a bash script - bash

I needed a bash script which extracts all the raster and vector images from the pdf and convert them to jpg format.
I checked many posts on the web and I got most of the ideas from these
How can I extract images from a PDF file?
Count the number of the raster images in the pdf
How to extract a vector figure from pdf?
It works and I share it because I didn't find a solution on the web like this.
But there are 2 small issues that I couldn't fix so far.
If there is a page with texts then pdf2svg will determine the texts as vector images and will generate an extra image with the texts.
Is there any way to distinguish the text from the real vector images?
If there are multiple vector images on one page then pdf2svg will generate one SVG image which contains all the vector images (same like a page contains text). Is it possible to extract them into separated images?
the bash script
#!/bin/bash
TMP_DIR=$1
SOURCE_PDF=$2
MAX_WIDTH=1920
MAX_HEIGHT=1080
echo "source: $SOURCE_PDF"
function burst
{
local source=$1
# explodes the pages to pdf files (it is necessary for the vector images export)
`/usr/bin/pdftk $source burst`
# removes the source pdf (we do not need it any more)
`rm $source`
# and the txt files which were generated by the pdftk
`rm *.txt`
}
# finds the pages as pdf files and call check_for_images function
function process_pages {
local tmp_dir=$1
local pnum=1
for f in `find . -type f -name "*.pdf"`
do
echo "processing page $f"
check_for_images $f $pnum
let "pnum++"
done
}
function check_for_images {
local pdf_page=$1
local pnum=$2
# checks whether the page contains a raster image
list_raster_images=`/usr/bin/pdfimages -list $pdf_page | grep -E "(jpeg|png|gif)"`
is_raster_images=${#list_raster_images}
if (( $is_raster_images > 0 )); then
# it contains raster image(s), extract them
extract_raster_images $pdf_page $pnum
else
# it does not contain raster image(s), try to extract vector images
extract_vector_images $pdf_page $pnum
fi;
rm $pdf_page
}
function extract_raster_images {
local pdf_page=$1
local pnum=$2
pdf_file="${pdf_page%.*}"
echo "extract all raster image(s) from this page";
`/usr/bin/pdfimages -all $pdf_page ./`
# we need to use a very same file name convention so this part renames them
# who knows it might be useful later
for f in `find . -regextype sed -regex ".*/-[0-9]\{3\}\.jpg"`
do
path=$(dirname $f)
img_file=$(basename $f)
img_ext="${img_file##*.}"
img_num="${img_file%.*}"
mv $f $path/$pdf_file$img_num.$img_ext
done
}
function extract_vector_images {
local pdf_page=$1
local pnum=$2
pdf_file="${pdf_page%.*}"
echo "extract vector image from the page as SVG"
`/usr/bin/pdf2svg $pdf_page $pdf_page.svg`
# just to be sure it is not a raster image
is_raster_image=`grep -c -i "data:image" $pdf_page.svg`
if (( $is_raster_images == 0 )); then
# convert SVG to PNG (it doesn't know JPG format) with fixed sizes, but keep the aspect ratio
`/usr/bin/rsvg-convert -a -w $MAX_WIDTH -h $MAX_HEIGHT -f png -o $pdf_page.png $pdf_page.svg`
# convert PNG to JPG
`convert $pdf_page.png -background white -flatten -alpha off $pdf_file-000.jpg`
fi;
`rm *.svg`
`rm *.png`
}
cd $TMP_DIR
burst $SOURCE_PDF
process_pages $TMP_DIR
executing it from php
$tmpName = basename($file['tmp_name']);
$tmpDir = '/path-of-tmp-dir' . $tmpName . '_extraction';
mkdir($tmpDir);
$command = "extract_pdf_images.sh $tmpDir ".$file['tmp_name'];
exec($command);
requirements
apt-get install pdftk pdfimages pdf2svg librsvg2-bin imagick

Related

Bash function not returning expected output

I'm trying to write a function for my Bash script to keep it DRY but for some reason the output of the code is not the same as when it's not inside a function.
What am I missing ?
Working:
#Get file name from file path
fileName="$(basename "$file")";
#Remove " ' and white space from name
fileName=${fileName//[\"\'\ ]/};
convert "$file" -resize $RESOLUTION\> "$OUTPUT_PATH"$fileName;
Not working:
function cleanUpName() {
#Get file name from file path
fileName="$(basename "$1")";
#Remove " ' and white space from name
echo ${fileName//[\"\'\ ]/};
}
convert "$file" -resize $RESOLUTION\> "$OUTPUT_PATH"$( cleanUpName $file);
As #Robin479 suggested in the comments, I was missing quotes for my file variable working code is as follow:
function cleanUpName() {
#Get file name from file path
fileName="$(basename "$1")";
#Remove " ' and white space from name
echo "${fileName//[\"\'\ ]/}"
}
convert "$file" -resize $RESOLUTION\> "$OUTPUT_PATH$( cleanUpName "${file}")"

How can I pass a variable as a string into a function in bash?

I have a small bash script that does simple file modifications and I want to rewrite the code to be more readable. My goal is to pass Commands as strings into a function that loops the command over a Directory.
I've tried to use different methods to escape the "$" or different """ combinations but none really work.
#!/bin/bash
process="/Users/Gernot/Tools/.Process"
output="/Users/Gernot/Tools/2 Output"
input="/Users/Gernot/Tools/1 Input/"
function run {
for file in "$input$1"/*
do
echo "running procedure $1" #echoes which procedure is running
$2 #does the command for every file in the directory
done
}
run "PDF Komprimieren" "magick convert \$file -density 110 -compress jpeg -quality 100 \$file"
This is the error I get:
running procedure PDF Komprimieren
convert: unable to open image '$file': No such file or directory # error/blob.c/OpenBlob/3497.
convert: no decode delegate for this image format `' # error/constitute.c/ReadImage/556.
convert: no images defined `$file' # error/convert.c/ConvertImageCommand/3273.
Try using functions like
pdf_komprimieren() {
find "PDF Komprimieren" -maxdepth 2 -type f -print0 |
xargs --null -n1 -Ifile magick convert "file" -density 110 -compress jpeg -quality 100 "file"
}

How to write every Nth file to new folder

I have this code which scans folders and moves all files in each folder to a new one.
How do I make it so only every Nth file is moved?
#!/bin/bash
# Save this file in the directory containing the folders (bb in this case)
# Then to run it, type:
# ./rencp.sh
# The first output frame number
let "frame=1"
# this is where files will go. A new directory will be created if it doesn't exist
outFolder="collected"
# print info every so many files.
feedbackFreq=250
# prefix for new files
namePrefix="ben_timelapse"
#new extension (uppercase is so ugly)
ext="jpg"
# this will make sure we only get files from camera directories
srcPattern="ND850"
mkdir -p $outFolder
for f in *${srcPattern}/*
do
mv $f `printf "$outFolder/$namePrefix.%05d.$ext" $frame`
if ! ((frame % $feedbackFreq)); then
echo "moved and renamed $frame files to $outFolder"
fi
let "frame++"
done
Pretty sure I need to edit the line for f in *${srcPattern}/* but not sure of the correct syntax
If files in the ND850 folders are sequential when listed (i.e. padded frame numbers), and the folders themselves are in order, then the following code should work.
#!/bin/bash
# Maintain a counter, and the output frame number
let "frame=1"
let "outframe=1"
outFolder="collected"
# frequency
gap=5
namePrefix="ben_timelapse"
#new extension (uppercase is so ugly)
ext="jpg"
srcPattern="ND850"
echo "Copying and renaming 1 in every $gap files"
mkdir -p "$outFolder"
for f in *${srcPattern}/*
do
if ! ((frame % $gap)); then
outfile=`printf "$outFolder/$namePrefix.%05d.$ext" $outframe`
cp $f "$outfile"
echo "copied $f to $outfile"
let "outframe++"
fi
let "frame++"
done
Try this instead of your mv command after do:
if ! ((frame % 5)); then
a=$((frame / 5));
mv $f `printf "$outFolder/$namePrefix.%05d.$ext" $a`
fi
It will move frame=5,10, and so on, to $outFolder/$namePrefix.00001.$ext,$outFolder/$namePrefix.00002.$ext, and so on

A bash script to split a data file into many sub-files as per an index file using dd

I have a large data file that contains many joint files.
It has an separate index file has that file name, start + end byte of each file within the data file.
I'm needing help in creating a bash script to split the large file into it's 1000's of sub files.
Data File : fileafilebfilec etc
Index File:
filename.png<0>3049
folder\filename2.png<3049>6136.
I guess this needs to loop through each line of the index file, then using dd to extract the relevant bytes into a file. Maybe a fiddly part might be the folder structure bracket being windows style rather than linux style.
Any help much appreciated.
while read p; do
q=${p#*<}
startbyte=${q%>*}
endbyte=${q#*>}
filename=${p%<*}
count=$(($endbyte - $startbyte))
toprint="processing $filename startbyte: $startbyte endbyte: $endbyte count: $c$
echo $toprint
done <indexfile
Worked it out :-) FYI:
while read p; do
#sort out variables
q=${p#*<}
startbyte=${q%>*}
endbyte=${q#*>}
filename=${p%<*}
count=$(($endbyte - $startbyte))
#let it know we're working
toprint="processing $filename startbyte: $startbyte endbyte: $endbyte count: $c$
echo $toprint
if [[ $filename == *"/"* ]]; then
echo "have found /"
directory=${filename%/*}
#if no directory exists, create it
if [ ! -d "$directory" ]; then
# Control will enter here if $directory doesn't exist.
echo "directory not found - creating one"
mkdir ~/etg/$directory
fi
fi
dd skip=$startbyte count=$count if=~/etg/largefile of=~/etg/$filename bs=1
done <indexfile

Combine images with ImageMagick using Applescript in Automator

So i have a PDF file and i'm using Automator to generate JPG files from each slide. Then i want to loop through each of these images and create side-by-side photos from them. So page 1 and page 2 will create an image, then page 3 and 4 too, and so on... I have an Applescript action so far where the input is all of the image files:
on run {input, parameters}
set selectedFiles to {}
repeat with i in input
copy (POSIX path of i) to end of selectedFiles
end repeat
return selectedFiles
end run
The function what i want to do is this, which is an ImageMagick function two merge to images next to eachother:
do shell script "convert " +append 01.jpg 02.jpg 01&2.jpg
How can i add this function inside my Applescript above?
Or maybe it would be easier to do with Shell script?
I'm in a bit of rush so not my best piece of code, but you can do it in bash like this:
#!/bin/bash
# pdfsheets
#
# Passs the name of a PDF as parameter and get it as a bunch of double-page spreads called "sheet-0.jpg" ... "sheet-n.jpg"
#
# Pick up parameter
pdf=$1
echo "Processing document: $pdf"
# Split document into individual pages each as JPEG
convert "$pdf" page-$$-%03d.jpg
# Get names of pages into array and see how many we got
pages=( $(ls page-$$-*.jpg) )
npages=${#pages[#]}
echo DEBUG:npages:$npages
# Check if odd number of pages - synthesize empty white one at end if odd
if [ $((npages%2)) -ne 0 ]; then
lastpage=${pages[#]:(-1)}
echo DEBUG:lastpage:$lastpage
newlast=$(printf "page-$$-%03d.jpg" $npages)
convert "$lastpage" -threshold -1 $newlast
pages=( $(ls page-$$-*.jpg) )
npages=${#pages[#]}
fi
s=0
for ((i=0;i<npages;i+=2)) do
a=${pages[i]}
b=${pages[((i+1))]}
out=$(printf "sheet-%d.jpg" $s)
echo Converting $a and $b to $out
convert $a $b +append $out
((s++))
done
Run it like this and get output as follows:
./pdfsheets document.pdf
Processing document: document.pdf
DEBUG:npages:5
DEBUG:lastpage:page-49169-004.jpg
Converting page-49169-000.jpg and page-49169-001.jpg to sheet-0.jpg
Converting page-49169-002.jpg and page-49169-003.jpg to sheet-1.jpg
Converting page-49169-004.jpg and page-49169-005.jpg to sheet-2.jpg

Resources