re-ordering the files in shell script - bash

In my system file are in below format
image0.jpg
image1.jpg
image2.jpg
image3.jpg
.
.
.
.
.
image10.jpg
image11.jpg
i can order the file using this below script
for i in *.jpg; do
new=$(printf "xome%04d.jpg" "$a") #04 pad to length of 4
mv -- "$i" "$new"
let a=a+1
done
but what happen is it get the file in this order like image0.jpg,image10.jpg,image1.jpg,image11.jpg,image2.jpg..............
how to get sequence using shell script

Try this :
for i in {1..11}; do
printf -v newfile "xome%04d.jpg" "$i"
[ -f "image${i}" ] && mv "image${i}.jpg" "$newfile"
done
The -v option stores the printf result in variable newfile.
[ -f "image${i}" ] test if file exist before trying to rename.

Related

Sequentially numbering of files in different folders while keeping the name after the number

I have a lot of ogg or wave files in different folders that I want to sequentially number while keeping everything that stands behind the prefixed number. The input may look like this
Folder1/01 Insbruck.ogg
02 From Milan to Rome.ogg
03 From Rome to Naples.ogg
Folder2/01 From Naples to Palermo.ogg
02 From Palermo to Syracrus.ogg
03 From Syracrus to Tropea
The output should be:
Folder1/01 Insbruck.ogg
02 From Milan to Rome.ogg
03 From Rome to Naples.ogg
Folder2/04 From Naples to Palermo.ogg
05 From Palermo to Syracrus.ogg
06 From Syracrus to Tropea.ogg
The sequential numbering across folders can be done with this BASH script that I found here:
find . | (i=0; while read f; do
let i+=1; mv "$f" "${f%/*}/$(printf %04d "$i").${f##*.}";
done)
But this script removes the title that I would like to keep.
TL;DR
Like this, using find and perl rename:
rename -n 's#/\d+#sprintf "/%0.2d", ++$::c#e' Folder*/*
Drop -n switch if the output looks good.
With -n, you only see the files that will really be renamed, so only 3 files from Folder2.
Going further
The variable $::c (or $main::c is a package variable) is a little hack to avoid the use of more complex expressions:
rename -n 's#/\d+#sprintf "/%0.2d", ++our $c#e' Folder*/*
or
rename -n '{ no strict; s#/\d+#sprintf "/%0.2d", ++$c#e; }' Folder*/*
or
rename -n '
do {
use 5.012;
state $c = 0;
s#/\d+#sprintf "/%0.2d", ++$c#e
}
' Folder*/*
Thanks go|dfish & Grinnz on freenode
A bash script for this job would be:
#!/bin/bash
argc=$#
width=${#argc}
n=0
for src; do
base=$(basename "$src")
dir=$(dirname "$src")
if ! [[ $base =~ ^[0-9]+\ .*\.(ogg|wav)$ ]]; then
echo "$src: Unexpected file name. Skipping..." >&2
continue
fi
printf -v dest "$dir/%0${width}d ${base#* }" $((++n))
echo "moving '$src' to '$dest'"
# mv -n "$src" "$dest"
done
and could be run as
./renum Folder*/*
assuming the script is saved as renum. It will just print out source and destination file names. To do actual moving, you should drop the # at the beginning of the line # mv -n "$src" "$dest" after making sure it will work as expected. Note that the mv command will not overwrite an existing file due to the -n option. This may or may not be desirable. The script will print out a warning message and skip unexpected file names, that is, the file names not fitting the pattern specified in the question.
The sequential numbering across folders can be done with this BASH script that I found here:
find . | (i=0; while read f; do
let i+=1; mv "$f" "${f%/*}/$(printf %04d "$i").${f##*.}";
done)
But this script removes the title that I would like to keep.
Not as robust as the accepted answer but this is the improved version of your script and just in case rename is not available.
#!/usr/bin/env bash
[[ -n $1 ]] || {
printf >&2 'Needs a directory as an argument!\n'
exit 1
}
n=1
directory=("$#")
while IFS= read -r files; do
if [[ $files =~ ^(.+)?\/([[:digit:]]+[^[:blank:]]+)(.+)$ ]]; then
printf -v int '%02d' "$((n++))"
[[ -e "${BASH_REMATCH[1]}/$int${BASH_REMATCH[3]}" ]] && {
printf '%s is already in sequential order, skipping!\n' "$files"
continue
}
echo mv -v "$files" "${BASH_REMATCH[1]}/$int${BASH_REMATCH[3]}"
fi
done < <(find "${directory[#]}" -type f | sort )
Now run the script with the directory in question as the argument.
./myscript Folder*/
or
./myscript Folder1/
or
./myscript Folder2/
or a . the . is the current directory.
./myscript .
and so on...
Remove the echo if you're satisfied with the output.

Add character to file name if duplicate when moving with bash

I currently use a bash script and PDFgrep to rename files to a certain structure. However, in order to stop overriding if the new file has a duplicate name, I want to add a number at the end of the name. Keep in mind that there may be 3 or 4 duplicate names. What's the best way to do this?
#!/bin/bash
if [ $# -ne 1 ]; then
echo Usage: Renamer file
exit 1
fi
f="$1"
id1=$(pdfgrep -m 1 -i "MR# : " "$f" | grep -oE "[M][0-9][0-9]+") || continue
id2=$(pdfgrep -m 1 -i "Visit#" "$f" | grep -oE "[V][0-9][0-9]+") || continue
{ read today; read dob; read dop; } < <(pdfgrep -i " " "$f" | grep -oE "[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]")
dobsi=$(echo $dob | sed -e 's/\//-/g')
dopsi=$(echo $dop | sed -e 's/\//-/g')
mv -- "$f" "${id1}_${id2}_$(printf "$dobsi")_$(printf "$dopsi")_1.pdf"
Use a loop that checks if the destination filename exists, and increments a counter if it does. Replace the mv line with this:
prefix="${id1}_{id2}_${dob}_${dop}"
counter=0
while true
do
if [ "$counter" -ne 0 ]
then target="${prefix}_${counter}.pdf"
else target="${prefix}.pdf"
fi
if [ ! -e "$target" ]
then
mv -- "$f" "$target"
break
fi
((counter++))
done
Note that this suffers from a TOCTTOU problem, if the duplicate file is created between the ! -f "$target" test and the mv. I thought it would be possible to replace the existence check with using mv -n; but while this won't overwrite the file, it still treats the mv as successful, so you can't test the result to see if you need to increment the counter.

Automatically rename fasta files with the ID of the first sequence in each file

I have multiple fasta files with single sequence in the same directory. I want to rename each fasta file with the header of the single sequence present in the fasta file. When i run my code , i obtain "Substitution pattern not terminated at (user-supplied code)"
my code:
#!/bin/bash
for i in /home/maryem/files/;
do
if [ ! -f $i ]; then
echo "skipping $i";
else
newname=`head -1 $i | sed 's/^\s*\([a-zA-Z0-9]\+\).*$/\1/'`;
[ -n "$newname" ] ;
mv -i $i $newname.fasta || echo "error at: $i";
fi;
done | rename s/ // *.fasta
fasta file:
>NC_013361.1 Escherichia coli O26:H11 str. 11368 DNA, complete genome
AGCTTTTCATTCTGACTGCAATGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTCTGACAGCAGCTTCTGAACTG
GTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGAC
AGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTATCACCACCATCACCATTACCACAGGT
I'm not sure if there is another way to rename each file with the ID in the header ??
Given that the ID is the first "word" of the file, you can run the following in the directory containing the fasta files.
for f in *.fasta; do d="$(head -1 "$f" | awk '{print $1}').fasta"; if [ ! -f "$d" ]; then mv "$f" "$d"; else echo "File '$d' already exists! Skiped '$f'"; fi; done
Credit: https://unix.stackexchange.com/a/13161

create and rename multiple copies of files

I have a file input.txt that looks as follows.
abas_1.txt
abas_2.txt
abas_3.txt
1fgh.txt
3ghl_1.txt
3ghl_2.txt
I have a folder ff. The filenames of this folder are abas.txt, 1fgh.txt, 3ghl.txt. Based on the input file, I would like to create and rename the multiple copies in ff folder.
For example in the input file, abas has three copies. In the ff folder, I need to create the three copies of abas.txt and rename it as abas_1.txt, abas_2.txt, abas_3.txt. No need to copy and rename 1fgh.txt in ff folder.
Your valuable suggestions would be appreciated.
You can try something like this (to be run from within your folder ff):
#!/bin/bash
while IFS= read -r fn; do
[[ $fn =~ ^(.+)_[[:digit:]]+\.([^\.]+)$ ]] || continue
fn_orig=${BASH_REMATCH[1]}.${BASH_REMATCH[2]}
echo cp -nv -- "$fn_orig" "$fn"
done < input.txt
Remove the echo if you're happy with it.
If you don't want to run from within the folder ff, just replace the line
echo cp -nv -- "$fn_orig" "$fn"
with
echo cp -nv -- "ff/$fn_orig" "ff/$fn"
The -n option to cp so as to not overwrite existing files, and the -v option to be verbose. The -- tells cp that there are no more options beyond this point, so that it will not be confused if one of the files starts with a hyphen.
using for and grep :
#!/bin/bash
for i in $(ls)
do
x=$(echo $i | sed 's/^\(.*\)\..*/\1/')"_"
for j in $(grep $x in)
do
cp -n $i $j
done
done
Try this one
#!/bin/bash
while read newFileName;do
#split the string by _ delimiter
arr=(${newFileName//_/ })
extension="${newFileName##*.}"
fileToCopy="${arr[0]}.$extension"
#check for empty : '1fgh.txt' case
if [ -n "${arr[1]}" ]; then
#check if file exists
if [ -f $fileToCopy ];then
echo "copying $fileToCopy -> $newFileName"
cp "$fileToCopy" "$newFileName"
#else
# echo "File $fileToCopy does not exist, so it can't be copied"
fi
fi
done
You can call your script like this:
cat input.txt | ./script.sh
If you could change the format of input.txt, I suggest you adjust it in order to make your task easier. If not, here is my solution:
#!/bin/bash
SRC_DIR=/path/to/ff
INPUT=/path/to/input.txt
BACKUP_DIR=/path/to/backup
for cand in `ls $SRC_DIR`; do
grep "^${cand%.*}_" $INPUT | while read new
do
cp -fv $SRC_DIR/$cand $BACKUP_DIR/$new
done
done

Parallelize nested for loop in GNU Parallel

I have a small bash script to OCR PDF files (slightly modified this script). The basic flow for each file is:
For each page in pdf FILE:
Convert page to TIFF image (imegamagick)
OCR image (tesseract)
Cat results to text file
Script:
FILES=/home/tgr/OCR/input/*.pdf
for f in $FILES
do
FILENAME=$(basename "$f")
ENDPAGE=$(pdfinfo $f | grep "^Pages: *[0-9]\+$" | sed 's/.* //')
OUTPUT="/home/tgr/OCR/output/${FILENAME%.*}.txt"
RESOLUTION=1400
touch $OUTPUT
for i in `seq 1 $ENDPAGE`; do
convert -monochrome -density $RESOLUTION $f\[$(($i - 1 ))\] page.tif
echo processing file $f, page $i
tesseract page.tif tempoutput -l ces
cat tempoutput.txt >> $OUTPUT
done
rm tempoutput.txt
rm page.tif
done
Because of high resolution and fact that tesseract can utilize only one core, the process is extremely slow (takes approx. 3 minutes to convert one PDF file).
Because I have thousands of PDF files I think I can use parallel to use all 4 cores, but I don't get the concept how to use it. In examples I see:
Nested for-loops like this:
(for x in `cat xlist` ; do
for y in `cat ylist` ; do
do_something $x $y
done
done) | process_output
can be written like this:
parallel do_something {1} {2} :::: xlist ylist | process_output
Unfortunately I was not able to figure out how to apply this. How do I parallelize my script?
Since you have 1000s of PDF files it is probably enough simply to parallelize the processing of PDF-files and not parallelize the processing of the pages in a single file.
function convert_func {
f=$1
FILENAME=$(basename "$f")
ENDPAGE=$(pdfinfo $f | grep "^Pages: *[0-9]\+$" | sed 's/.* //')
OUTPUT="/home/tgr/OCR/output/${FILENAME%.*}.txt"
RESOLUTION=1400
touch $OUTPUT
for i in `seq 1 $ENDPAGE`; do
convert -monochrome -density $RESOLUTION $f\[$(($i - 1 ))\] $$.tif
echo processing file $f, page $i
tesseract $$.tif $$ -l ces
cat $$.txt >> $OUTPUT
done
rm $$.txt
rm $$.tif
}
export -f convert_func
parallel convert_func ::: /home/tgr/OCR/input/*.pdf
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial (man parallel_tutorial or http://www.gnu.org/software/parallel/parallel_tutorial.html). You command line
with love you for it.
Read the EXAMPLEs (LESS=+/EXAMPLE: man parallel).
You can have a script like this.
#!/bin/bash
function convert_func {
local FILE=$1 RESOLUTION=$2 PAGE_INDEX=$3 OUTPUT=$4
local TEMP0=$(exec mktemp --suffix ".00.$PAGE_INDEX.tif")
local TEMP1=$(exec mktemp --suffix ".01.$PAGE_INDEX")
echo convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP0" ## Just for debugging purposes.
convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP0"
echo "processing file $FILE, page $PAGE_INDEX" ## I think you mean to place this before the line above.
tesseract "$TEMP0" "$TEMP1" -l ces
cat "$TEMP1".txt >> "$OUTPUT" ## Lines may be mixed up from different processes here and a workaround may still be needed but it may no longer be necessary if outputs are small enough.
rm -f "$TEMP0" "$TEMP1"
}
export -f convert_func
FILES=(/home/tgr/OCR/input/*.pdf)
for F in "${FILES[#]}"; do
FILENAME=${F##*/}
ENDPAGE=$(exec pdfinfo "$F" | grep '^Pages: *[0-9]\+$' | sed 's/.* //')
OUTPUT="/home/tgr/OCR/output/${FILENAME%.*}.txt"
RESOLUTION=1400
touch "$OUTPUT" ## This may no longer be necessary. Or probably you mean to truncate it instead e.g. : > "$OUTPUT"
for (( I = 1; I <= ENDPAGE; ++I )); do
printf "%s\xFF%s\xFF%s\xFF%s\x00" "$F" "$RESOLUTION" "$I" "$OUTPUT"
done | parallel -0 -C $'\xFF' -j 4 -- convert_func '{1}' '{2}' '{3}' '{4}'
done
It exports a function that's importable by parallel, make proper sanitation of arguments, and unique temporary files to make parallel processing possible.
Update. This would hold output on multiple temporary files first before concatenating them to one main output file.
#!/bin/bash
shopt -s nullglob
function convert_func {
local FILE=$1 RESOLUTION=$2 PAGE_INDEX=$3 OUTPUT=$4 TEMPLISTFILE=$5
local TEMP_TIF=$(exec mktemp --suffix ".01.$PAGE_INDEX.tif")
local TEMP_TXT_BASE=$(exec mktemp --suffix ".02.$PAGE_INDEX")
echo "processing file $FILE, page $PAGE_INDEX"
echo convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP_TIF" ## Just for debugging purposes.
convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP_TXT_BASE"
tesseract "$TEMP_TIF" "$TEMP_TXT_BASE" -l ces
echo "$PAGE_INDEX"$'\t'"${TEMP_TXT_BASE}.txt" >> "$TEMPLISTFILE"
rm -f "$TEMP_TIF"
}
export -f convert_func
FILES=(/home/tgr/OCR/input/*.pdf)
for F in "${FILES[#]}"; do
FILENAME=${F##*/}
ENDPAGE=$(exec pdfinfo "$F" | grep '^Pages: *[0-9]\+$' | sed 's/.* //')
BASENAME=${FILENAME%.*}
OUTPUT="/home/tgr/OCR/output/$BASENAME.txt"
RESOLUTION=1400
TEMPLISTFILE=$(exec mktemp --suffix ".00.$BASENAME")
: > "$TEMPLISTFILE"
for (( I = 1; I <= ENDPAGE; ++I )); do
printf "%s\xFF%s\xFF%s\xFF%s\x00" "$F" "$RESOLUTION" "$I" "$OUTPUT"
done | parallel -0 -C $'\xFF' -j 4 -- convert_func '{1}' '{2}' '{3}' '{4}' "$TEMPLISTFILE"
while IFS=$'\t' read -r __ FILE; do
cat "$FILE"
rm -f "$FILE"
done < <(exec sort -n "$TEMPLISTFILE") > "$OUTPUT"
rm -f "$TEMPLISTFILE"
done

Resources