Unix: Combine PDF -files and images into PDF -file? - image

My friend is asking this question, he is using Mac and cannot get PdfLatex working (having no dev CD, related here). Anyway my first idea:
$ pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdf [only pdfs]
$ convert 1.png 2.png myfile.pdf [only images]
Now I don't know without LaTex or iPad's Notes Plus how to combine images and PDF -files. So how can I combine pdf -files and images in Unix?

You could run a loop, identifying PDF and images, and converting images to PDF with ImageMagick. When you're done, you assemble it all with pdftk.
This is a Bash-only script.
#!/bin/bash
# Convert arguments into list
N=0
for file in $*; do
files[$N]=$file
N=$[ $N + 1 ]
done
# Last element of list is our destination filename
N=$[ $N - 1 ]
LAST=$files[$N]
unset files[$N]
N=$[ $N - 1 ]
# Check all files in the input array, converting image types
T=0
for i in $( seq 0 $N ); do
file=${files[$i]}
case ${file##*.} in
jpg|png|gif|tif)
temp="tmpfile.$T.pdf"
convert $file $temp
tmp[$T]=$temp
uses[$i]=$temp
T=$[ $T + 1 ]
# Or also: tmp=("${tmp[#]}" "$temp")
;;
pdf)
uses[$i]=$file
;;
esac
done
# Now assemble PDF files
pdftk ${uses[#]} cat output $LAST
# Destroy all temporary file names. Disabled because you never know :-)
echo "I would remove ${tmp[#]}"
# rm ${tmp[#]}

I am gathering here some info.
Unix Commandline
More about Pdftk here.
Merging png images into one pdf file
Combine all files in a folder as pdf
Mac
Because the moderator in Apple SE removed the useful thread "Merge PDF files and images to single PDF file in Mac?"
here
-- I collect the tips here for Mac -- sorry but the moderator is very intolerant about collecting Newbie -things.
https://apple.stackexchange.com/questions/16226/what-software-is-available-preferably-free-to-create-and-edit-pdf-files-on-mac
https://apple.stackexchange.com/questions/812/how-can-i-combine-two-pdfs-in-preview
https://apple.stackexchange.com/questions/11163/how-do-i-combine-two-or-more-images-to-get-a-single-pdf-file
https://apple.stackexchange.com/questions/69659/ipad-pdf-software-to-edit-merge-annotate-etc-well-pdf-documents-like-in-deskto

Related

Adding extension to file using file command in linux

I have a file that does not have an extension and would like to add an extension to it programmatically. I know the file command gives information about the extension of a file. How can I utilize this to add the extension to a file? The files I'm downloading can be assumed to be image files (png, jpg, etc.)
My desired outcome would be:
Input: filename
Output: filename.ext
All inside a bash script
Something like this should get you started:
#!/bin/bash
for f in "$#"; do
if [[ $f == *'.'* ]]; then continue; fi # Naive check to make sure we don't add duplicate extensions
ext=''
case $(file -b "$f") in
*ASCII*) ext='.txt' ;;
*JPEG*) ext='.jpg' ;;
*PDF*) ext='.pdf' ;;
# etc...
*) continue ;;
esac
mv "${f}" "${f}${ext}"
done
You'll have to check the output of file for each potential file type to find an appropriate case label.
You can try to find or create a map of file-type to file extension name but there's no universal way. Think about JPEG images, you can either have .jpg or .jpeg extension, and they both mean the same thing. Same for MP4 video containers...
Also, on linux the file extension doesn't even matter to most programs so you could just not care about it, but if you still want to do it for certain types of files, you can check this answer : https://stackoverflow.com/a/6115923/9759362

Linux tools: How to apply a filter on a PDF document?

I am a researcher and I have to read research papers. Unfortunately, the characters are not dark enough, so the papers are hard to read when printed on the paper. Note that the printer's cartridge has no problem, but the characters are not printed dark enough (the text is already in black: take a look at a sample).
This is how the characters look like in Photoshop:
[
Note that the background is transparent when you import a PDF document in photoshop.
I use this awful solution:
First, I Import the PDF document into Photoshop. The pages are imported as individual images with transparent background.
Then, for each page I do either of these two methods:
Method 1: Copy the layer over itself multiple times, so that the image gets darker
Method 2: Apply a Min filter on the image
This is how it looks like after conversion (left: Min filter, right: layer duplication)
[
This solves my problem for printing a single page and I can read the printed contents easily. However, it is hard to convert every page of every PDF paper using PHOTOSHOP!!!!. Is there any wiser solution/tool/application ???
Here is what I need:
1. How to convert PDF to high-quality image (either in Linux or Windows, with any tool).
2. How to apply Min Filter (or any better filter) on the image files automatically. (e.g. a script or whatever)
Thanks!
It can be done using four tools:
pdftoppm: Convert PDF file to separate png images
octave (MATLAB's open-source alternative): Generate an octave script that applies the min filter on the images. After running the script,
pdflatex: Create a single .tex file that imports all the images, then compile the .tex file to obtain a single PDF
bash!: A bash script that automates the process
Here is a fully automated solution as a single bash script:
pdftoppm -rx 300 -ry 300 -png $1 img
# Create an octave script than applieds min-filter to all files.
echo "#!/usr/bin/octave -qf" > script.m
echo "pkg load image" >> script.m
for i in `ls -1 img*.png`
do
echo "i = imread('$i');" >> script.m
echo "i(:,:,1) = ordfilt2(i(:,:,1),1,ones(3,3));" >> script.m
echo "i(:,:,2) = ordfilt2(i(:,:,2),1,ones(3,3));" >> script.m
echo "i(:,:,3) = ordfilt2(i(:,:,3),1,ones(3,3));" >> script.m
echo "imwrite(i,'p$i')" >> script.m
done
# Running the octave script
chmod 755 script.m
./script.m
# Converting png images to a single PDF
# Create a latex file that contains all image files
echo "\documentclass{article}" > f.tex
echo "\usepackage[active,tightpage]{preview}" >> f.tex
echo "\usepackage{graphicx}" >> f.tex
echo "\PreviewMacro[{*[][]{}}]{\includegraphics}" >> f.tex
echo "\begin{document}" >> f.tex
echo -n "%" >> f.tex
for i in `ls -1 pimg*.png`
do
echo "\newpage" >> f.tex
echo "\includegraphics{"$i"}" >> f.tex
done
echo "\end{document}" >> f.tex
#Compiling the latex document
pdflatex -synctex=1 -interaction=nonstopmode f

How to automate file transformation with simultanous execution?

I am working on transforming a lot of image files (png) into text files. I have the basic code to do this one by one, which is really time consuming. My process involves converting the image files into a black and white format and then using tesseract to transform those into a text file. This process works great but it would take days for me to acomplisyh my task if done file by file.
Here is my code:
for f in $1
do
echo "Processing $f file..."
convert $f -resample 200 -colorspace Gray ${f%.*}BW.png
echo "OCR'ing $f"
tesseract ${f.*}BW.png ${f%.*} -l tla -psm 6
echo "Removing black and white for $f"
rn ${f%.*}BW.png
done
echo "Done!"
Is there a way to perform this process to each file at the same time, that is, how would I be able to run this process simultaneously instead of one by one? My goal is to significantly reduce the amount of time it would take for me to transform these images into text files.
Thanks in advance.
You could make the content for your for loop a function then call the function multiple times but send each all to the background so you could execute another.
function my_process{
echo "Processing $1 file..."
convert $1 -resample 200 -colorspace Gray ${1%.*}BW.png
echo "OCR'ing $1"
tesseract ${1.*}BW.png ${1%.*} -l tla -psm 6
echo "Removing black and white for $1"
rn ${1%.*}BW.png
}
for file in ${files[#]}
do
# & at the end send it to the background.
my_process "$file" &
done
I want to thank contributors #Songy and #shellter.
To answer my question... I ended up using GNU Parallel in order to make these processes run in intervals of 5. Here is the code that I used:
parallel -j 5 convert {} "-resample 200 -colorspace Gray" {.}BW.png ::: *.png ; parallel -j 5 tesseract {} {} -l tla -psm 6 ::: *BW.png ; rm *BW.png
I am now in the process of splitting my dataset in order to run this command simultaneously with different subgroups of my (very large) pool of images.
Cheers

Execute Script to Run on Multiple Files

I have a script that I need to run on a large number of files.
This is the script and how it is run:
./tag-lbk.sh test.txt > output.txt
It takes a file as input and creates an output file. I need to run this on several input files, and I want a different output file for each input file.
How would I go about doing this? Can I make a script (I have not much experience writing bash scripts).
[edits]:
#fedorqui asked: Where are the names of the input files and output files stored?
There are several thousand files, each with a unique name. I was thinking maybe there is a way to recursively iterate through all the files (they are all .txt files). The output files should have names that are generated recursively, but in a random fashion.
Simple solution: Use two folders.
for input in /path/to/folder/*.txt ; do
name=$(basename "$input")
./tag-lbk.sh "$input" > "/path/to/output-folder/$name"
done
or, if you want everything in the same folder:
for input in *.txt ; do
if [[ "$input" = *-tagged.txt ]]; then
continue # skip output
fi
name=$(basename "$input" .txt)-tagged.txt
./tag-lbk.sh "$input" > "$name"
done
Try this with a small set of inputs somewhere where it doesn't matter when files get deleted, corrupted and overwritten.
The below script will find the files with extension .txt and redirect the output of the tag-1bk script to the randomly generated log file log.123 ..
#!/bin/bash
declare -a ar
# Find the files and store it in an array
# This way you don't iterate over the output files
# generated by this script
ar=($(find . -iname "*.txt"))
#Now iterate over the files and run your script
for i in "${ar[#]}"
do
#Create a random file in the format log.123,log.345
tmp_f=$(mktemp log.XXX)
#Redirect your output to the log file
./tag-lbk.sh "$i" > $tmp_f
done

bash while loop and ffmpeg

Sorry for this questions I imagine the answer is pretty straightforward however I have had a search and I haven't found anything that answers it.
I have written the following:
while read p # reads in each line of the text
do # file- folder_list.txt each iteration.
echo "$p" # print out current line
if [ -f $p/frame_number1.jpg ] # checks if the image exists in the specific folder
then
echo "$p" #prints specific folder name
sleep 1 #pause for 1 second
ffmpeg -f image2 -r 10 -i $p/frame_number%01d.jpg -r 30 $p/out.mp4 #create video
fi # end if statement
done <folder_list.txt #end of while loop
The script is supposed to read a text file which contains the folder tree structure, then check if the folder contains the specified JPEG if it does then the code should create a video from the images contained within the specified folder.
However what appears to be happening is the script skips whole folders that definitely contain images. I was wondering if the while loop is continuing to iterate whilst the video is being created or if something else is happening?
Any help would be greatly appreciated.
Many thanks in advance
Laurence
It could be because some folders contain spaces, or other weird characters.
Try this (won't work for ALL but for MOST cases):
while read p # reads in each line of the text file- folder_list.txt
do #each iteration
echo "$p" #print out current line
if [ -f "${p}/frame_number1.jpg" ] #checks wether the image exists in the specific folder
then echo "$p" #printsout specific folder name
sleep 1 #pause for 1 second
ffmpeg -f image2 -r 10 -i "${p}/frame_number%01d.jpg" -r 30 "${p}/out.mp4" #create video
fi # end if statement
done <folder_list.txt #end of while loop
I just added quotes around arguments involving "$p" so that if it contains space (and other things) it's not separated into several arguments, breaking the command
If this doesn't work, tell us exactly a few of the directories that are OK and the ones that are NOT ok :
ls -ald /some/ok/dir /some/NOTOK/dir #will show which directories belong to who
id #will tell us which user you are using, so we can compare to the previous and find out why you can't access some

Resources