for loop with standard output - for-loop

I want to run two scripts (gunzip and fastx_collapser) for multiple input files. The output from gunzip should be input to fastx_collapser. How do I do this in a loop function?
My try:
for f in *.gz gunzip -c "$f" | fastx_collapser -Q33 -z -o "${f%}.coll.gz"

You need a do and a done (and some semicolons if you insist on a single line):
for f in *.gz
do
gunzip -c "$f" | fastx_collapser -Q33 -z -o "${f%}.coll.gz"
done
or:
for f in *.gz; do gunzip -c "$f" | fastx_collapser -Q33 -z -o "${f%}.coll.gz"; done

Related

bash iterate over a directory sorted by file size

As a webmaster, I generate a lot of junk files of code. Periodically I have to purge the unneeded files filtered by extention. Example: "cleaner txt" Easy enough. But I want to sort the files by size and process them for the "for" loop. How can I do that?
cleaner:
#/bin/bash
if [ -z "$1" ]; then
echo "Please supply the filename suffixes to delete.";
exit;
fi;
filter=$1;
for FILE in *.$filter; do clear;
cat $FILE; printf '\n\n'; rm -i $FILE; done
You can use a mix of find (to print file sizes and names), sort (to sort the output of find) and cut (to remove the sizes). In case you have very unusual file names containing any possible character including newlines, it is safer to separate the files by a character that cannot be part of a name: NUL.
#/bin/bash
if [ -z "$1" ]; then
echo "Please supply the filename suffixes to delete.";
exit;
fi;
filter=$1;
while IFS= read -r -d '' -u 3 FILE; do
clear
cat "$FILE"
printf '\n\n'
rm -i "$FILE"
done 3< <(find . -mindepth 1 -maxdepth 1 -type f -name "*.$filter" \
-printf '%s\t%p\0' | sort -zn | cut -zf 2-)
Note that we must use a different file descriptor than stdin (3 in this example) to pass the file names to the loop. Else, if we use stdin, it will also be used to provide the answers to rm -i.
Inspired from this answer, you could use the find command as follows:
find ./ -type f -name "*.yaml" -printf "%s %p\n" | sort -n
find command prints the the size of the files and the path so that the sort command prints the results from the smaller one to the larger.
In case you want to iterate through (let's say) the 5 bigger files you can do something like this using the tail command like this:
for f in $(find ./ -type f -name "*.yaml" -printf "%s %p\n" |
sort -n |
cut -d ' ' -f 2)
do
echo "### $f"
done
If the file names don't contain newlines and spaces
while read filesize filename; do
printf "%-25s has size %10d\n" "$filename" "$filesize"
done < <(du -bs *."$filter"|sort -n)
while read filename; do
echo "$filename"
done < <(du -bs *."$filter"|sort -n|awk '{$0=$2}1')

find and grep / zgrep / lzgrep progress bar

I would like to add a progress bar to this command line:
find . \( -iname "*.bz" -o -iname "*.zip" -o -iname "*.gz" -o -iname "*.rar" \) -print0 | while read -d '' file; do echo "$file"; lzgrep -a stringtosearch\.anything "$file"; done
The progress file should be calculated on the total of compressed size files (not on the single file).
Of course, it can be a script too.
I would also like to add other progress bars, if possible:
The total number of files processed (example 3 out of 21)
The percentage of progress of the single file
Can anybody help me please?
Here some example of it should look alike (example from here):
tar cf - /folder-with-big-files -P | pv -s $(du -sb /folder-with-big-files | awk '{print $1}') | gzip > big-files.tar.gz
Multiple progress bars (example from here):
pv -cN orig < foo.tar.bz2 | bzcat | pv -cN bzcat | gzip -9 | pv -cN gzip > foo.tar.gz
Thanks,
This is the first time I've ever heard of pv and it's not on any machine I have access to but assuming it needs to know a total at startup and then a number on each iteration of a command, you could do something like this to get a progress bar per file processed:
IFS= readarray -d '' files < <(find . -whatever -print0)
printf '%s\n' "${files[#]}" | pv -s "${#files[#]}" | command
The first line gives you an array of files so you can then use "${#files[#]}" to provide pv it's initial total value (looks like you use -s value for that?) and then do whatever you normally do to get progress as each file is processed.
I don't see any way to tell pv that the pipe it's reading from is NUL-terminated rather than newline-terminated so if your files can have newlines in their names then you'd have to figure out how to solve that problem.
To additionally get progress on a single file you might need something like:
IFS= readarray -d '' files < <(find . -whatever -print0)
printf '%s\n' "${files[#]}" |
pv -s "${#files[#]}" |
xargs -n 1 -I {} sh -c 'pv {} | command'
I don't have pv so all of the above is untested so check the syntax, especially since I've never heard of pv :-).
Thanks to Max C., I found a solution for the main question:
find ./ -type f -iname *\.gz -o -iname *\.bz | (tot=0;while read fname; do s=$(stat -c%s "$fname"); if [ ! -z "$s" ] ; then echo "$fname"; tot=$(($tot+$s)); fi; done; echo $tot) | tac | (read size; xargs -i{} cat "{}" | pv -s $size | lzgrep -a something -)
But this work only for gz and bz files, now I have to develop to use different tool according to extension.
I'm gonna to try the Ed solution too.
Thanks to ED and Max C., here the verision 0.2
This version work with zgrep, but not with lzgrep. :-\
#!/bin/bash
echo -n "collecting dump... "
IFS= readarray -d '' files < <(find . \( -iname "*.bz" -o -iname "*.gz" \) -print0)
echo done
echo "Calculating archives size..."
tot=0
for line in "${files[#]}"; do
s=$(stat -c\%s "$line")
if [ ! -z "$s" ]
then
tot=$(($tot+$s))
fi
done
(for line in "${files[#]}"; do
s=$(stat -c\%s "$line")
if [ ! -z "$s" ]
then
echo "$line"
fi
done
) | xargs -i{} sh -c 'echo Processing file: "{}" 1>&2 ; cat "{}"' | pv -s $tot | zgrep -a anything -

Bash script to String concat two variables and do File compare

what I am trying to achieve is, to delete same filenames(filename+modfiedtimestamp)exisitng in Src_Dir1 and Src_Dir2
So first i have tried to deploy all the filenames to tempa(Src_Dir1) and tempb(Src_Dir2) respectively.
Below is the screenshot of the source directory.
Files inside archive be like this and few files outside too..
So, initially I am want to deal with the files inside Archive(SRC_Dir1) and later outside Archive(SRC_Dir2) what I am trying to do is to use a while loop to read each and every filename and string concat with the modified timestamp(mtime) and input to tempc(like for example it should be like AirTimeActs_2018-12-03.csv+2019-01-24 14:41:53.000000000 -0500 = AirTimeActs_2018-12-03.csv_2019-01-24 14:41:53.000000000 -0500 this is how it should be generating into tempc file for each and every filename inside Archive(SRC_Dir1). This is where I am stuck under string concat variable section on how to proceed. Please help me with the code, hope I am comprehensible.
IMPORTANT
(Really appreciate it, if you help me out with the extension of the code which i haven't mentioned here and yet to achieve which is - >
Have to implement the same code(which I am trying to do for tempa, I'd like to do it for tempb too and name it as tempd) and then do a file data compare between tempc and tempd) if there is any kind of same data filename, then delete the file existing in Src_Dir2, if there is no same data filename, then do nothing.)
#!/bin/bash
Src_Dir1=path/Airtime_Activation/Archive
Src_Dir2=path/Airtime_Activation/
find "$Src_Dir1" -maxdepth 1 -name "*.xlsx" -o -name "*.csv" | sed "s/.*\///" > -print>path/Airtime_Activation/temp_a
find "$Src_Dir2" -maxdepth 1 -name "*.xlsx" -o -name "*.csv" | sed "s/.*\///" > -print>path/Airtime_Activation/temp_b
echo 'phase1'
cat path/Airtime_Activation/temp_a | while read file;
do
echo 'phase1.5'
echo "$file"
echo 'phase2'
mtime=$(stat -c '%y' $file)
Full_name=${file}_${mtime}
echo "$Full_name" >> path/Airtime_Activation/temp_c
echo 'phase3'
done
#!/bin/bash
Src_Dir1=path/Airtime_Activation/Archive
Src_Dir2=path/Airtime_Activation/
find "$Src_Dir1" -maxdepth 1 -name "*.xlsx" -o -name "*.csv" | sed "s/.*\///" > -print>path/Airtime_Activation/temp_a
find "$Src_Dir2" -maxdepth 1 -name "*.xlsx" -o -name "*.csv" | sed "s/.*\///" > -print>path/Airtime_Activation/temp_b
echo 'phase1'
cat path/Airtime_Activation/temp_a | while read file;
do
echo 'phase1.5'
echo "$file"
echo 'phase2'
mtime=$(stat -c '%y' $file)
Full_name=${file}_${mtime}
echo "$Full_name" >> path/Airtime_Activation/temp_c
echo 'phase3'
done
cat /path/Airtime_Activation/temp_b | while read file
#while IFS="" read -r -d $'\0' file;
do
#echo "$file"
echo 'phase2'
mtime=$(stat -c '%y' $Src_Dir2/$file)
Full_name=${file}_${mtime}
echo "$Full_name" >> path/temp_d
echo 'phase3'
done
#file compare and delete old files from outisde archive
grep -Ff temp_d temp_c > path/Airtime_Activation/temp_e
cat path/Airtime_Activation/temp_e | while read file
#while IFS="" read -r -d $'\0' file;
do
#echo "$file"
echo 'phase2'
echo "${file%_*}"
rm $Src_Dir2/${file%_*}
echo 'phase3'
done

Is there a way to optimize this code and make it faster or anyother better solution?

I am looking to collect yang models from my project .jar files.Though i came with an approach but it takes time and my colleagues are not happy.
#!/bin/sh
set -e
# FIXME: make this tuneable
OUTPUT="yang models"
INPUT="."
JARS=`find $INPUT/system/org/linters -type f -name '*.jar' | sort -u`
# FIXME: also wipe output?
[ -d "$OUTPUT" ] || mkdir "$OUTPUT"
for jar in $JARS; do
artifact=`basename $jar | sed 's/.jar$//'`
echo "Extracting modules from $artifact"
# FIXME: better control over unzip errors
unzip -q "$jar" 'META-INF/yang/*' -d "$artifact" \
2>/dev/null || true
dir="$artifact/META-INF/yang"
if [ -d "$dir" ]; then
for file in `find $dir -type f -name '*.yang'`; do
module=`basename "$file"`
echo -e "\t$module"
# FIXME: better duplicate detection
mv -n "$file" "$OUTPUT"
done
fi
rm -rf "$artifact"
done
If the .jar files don't all change between invocations of your script then you could make the script significantly faster by caching the .jar files and only operating on the ones that changed, e.g.:
#!/bin/env bash
set -e
# FIXME: make this tuneable
output='yang models'
input='.'
cache='/some/where'
mkdir -p "$cache" || exit 1
readarray -d '' jars < <(find "$input/system/org/linters" -type f -name '*.jar' -print0 | sort -zu)
# FIXME: also wipe output?
mkdir -p "$output" || exit 1
for jarpath in "${jars[#]}"; do
diff -q "$jarpath" "$cache" || continue
cp "$jarpath" "$cache"
jarfile="${jarpath##*/}"
artifact="${jarfile%.*}"
printf 'Extracting modules from %s\n' "$artifact"
# FIXME: better control over unzip errors
unzip -q "$jarpath" 'META-INF/yang/*' -d "$artifact" 2>/dev/null
dir="$artifact/META-INF/yang"
if [ -d "$dir" ]; then
readarray -d '' yangs < <(find "$dir" -type f -name '*.yang' -print0)
for yangpath in "${yangs[#]}"; do
yangfile="${yangpath##*/}"
printf '\t%s\n' "$yangfile"
# FIXME: better duplicate detection
mv -n "$yangpath" "$output"
done
fi
rm -rf "$artifact"
done
See Correct Bash and shell script variable capitalization, http://mywiki.wooledge.org/BashFAQ/082, https://mywiki.wooledge.org/Quotes, How can I store the "find" command results as an array in Bash for some of the other changes I made above.
I assume you have some reason for looping on the .yang files and not moving them if a file by the same name already exists rather than unzipping the .jar file into the final output directory.

Parallelize nested for loop in GNU Parallel

I have a small bash script to OCR PDF files (slightly modified this script). The basic flow for each file is:
For each page in pdf FILE:
Convert page to TIFF image (imegamagick)
OCR image (tesseract)
Cat results to text file
Script:
FILES=/home/tgr/OCR/input/*.pdf
for f in $FILES
do
FILENAME=$(basename "$f")
ENDPAGE=$(pdfinfo $f | grep "^Pages: *[0-9]\+$" | sed 's/.* //')
OUTPUT="/home/tgr/OCR/output/${FILENAME%.*}.txt"
RESOLUTION=1400
touch $OUTPUT
for i in `seq 1 $ENDPAGE`; do
convert -monochrome -density $RESOLUTION $f\[$(($i - 1 ))\] page.tif
echo processing file $f, page $i
tesseract page.tif tempoutput -l ces
cat tempoutput.txt >> $OUTPUT
done
rm tempoutput.txt
rm page.tif
done
Because of high resolution and fact that tesseract can utilize only one core, the process is extremely slow (takes approx. 3 minutes to convert one PDF file).
Because I have thousands of PDF files I think I can use parallel to use all 4 cores, but I don't get the concept how to use it. In examples I see:
Nested for-loops like this:
(for x in `cat xlist` ; do
for y in `cat ylist` ; do
do_something $x $y
done
done) | process_output
can be written like this:
parallel do_something {1} {2} :::: xlist ylist | process_output
Unfortunately I was not able to figure out how to apply this. How do I parallelize my script?
Since you have 1000s of PDF files it is probably enough simply to parallelize the processing of PDF-files and not parallelize the processing of the pages in a single file.
function convert_func {
f=$1
FILENAME=$(basename "$f")
ENDPAGE=$(pdfinfo $f | grep "^Pages: *[0-9]\+$" | sed 's/.* //')
OUTPUT="/home/tgr/OCR/output/${FILENAME%.*}.txt"
RESOLUTION=1400
touch $OUTPUT
for i in `seq 1 $ENDPAGE`; do
convert -monochrome -density $RESOLUTION $f\[$(($i - 1 ))\] $$.tif
echo processing file $f, page $i
tesseract $$.tif $$ -l ces
cat $$.txt >> $OUTPUT
done
rm $$.txt
rm $$.tif
}
export -f convert_func
parallel convert_func ::: /home/tgr/OCR/input/*.pdf
Watch the intro video for a quick introduction:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial (man parallel_tutorial or http://www.gnu.org/software/parallel/parallel_tutorial.html). You command line
with love you for it.
Read the EXAMPLEs (LESS=+/EXAMPLE: man parallel).
You can have a script like this.
#!/bin/bash
function convert_func {
local FILE=$1 RESOLUTION=$2 PAGE_INDEX=$3 OUTPUT=$4
local TEMP0=$(exec mktemp --suffix ".00.$PAGE_INDEX.tif")
local TEMP1=$(exec mktemp --suffix ".01.$PAGE_INDEX")
echo convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP0" ## Just for debugging purposes.
convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP0"
echo "processing file $FILE, page $PAGE_INDEX" ## I think you mean to place this before the line above.
tesseract "$TEMP0" "$TEMP1" -l ces
cat "$TEMP1".txt >> "$OUTPUT" ## Lines may be mixed up from different processes here and a workaround may still be needed but it may no longer be necessary if outputs are small enough.
rm -f "$TEMP0" "$TEMP1"
}
export -f convert_func
FILES=(/home/tgr/OCR/input/*.pdf)
for F in "${FILES[#]}"; do
FILENAME=${F##*/}
ENDPAGE=$(exec pdfinfo "$F" | grep '^Pages: *[0-9]\+$' | sed 's/.* //')
OUTPUT="/home/tgr/OCR/output/${FILENAME%.*}.txt"
RESOLUTION=1400
touch "$OUTPUT" ## This may no longer be necessary. Or probably you mean to truncate it instead e.g. : > "$OUTPUT"
for (( I = 1; I <= ENDPAGE; ++I )); do
printf "%s\xFF%s\xFF%s\xFF%s\x00" "$F" "$RESOLUTION" "$I" "$OUTPUT"
done | parallel -0 -C $'\xFF' -j 4 -- convert_func '{1}' '{2}' '{3}' '{4}'
done
It exports a function that's importable by parallel, make proper sanitation of arguments, and unique temporary files to make parallel processing possible.
Update. This would hold output on multiple temporary files first before concatenating them to one main output file.
#!/bin/bash
shopt -s nullglob
function convert_func {
local FILE=$1 RESOLUTION=$2 PAGE_INDEX=$3 OUTPUT=$4 TEMPLISTFILE=$5
local TEMP_TIF=$(exec mktemp --suffix ".01.$PAGE_INDEX.tif")
local TEMP_TXT_BASE=$(exec mktemp --suffix ".02.$PAGE_INDEX")
echo "processing file $FILE, page $PAGE_INDEX"
echo convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP_TIF" ## Just for debugging purposes.
convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP_TXT_BASE"
tesseract "$TEMP_TIF" "$TEMP_TXT_BASE" -l ces
echo "$PAGE_INDEX"$'\t'"${TEMP_TXT_BASE}.txt" >> "$TEMPLISTFILE"
rm -f "$TEMP_TIF"
}
export -f convert_func
FILES=(/home/tgr/OCR/input/*.pdf)
for F in "${FILES[#]}"; do
FILENAME=${F##*/}
ENDPAGE=$(exec pdfinfo "$F" | grep '^Pages: *[0-9]\+$' | sed 's/.* //')
BASENAME=${FILENAME%.*}
OUTPUT="/home/tgr/OCR/output/$BASENAME.txt"
RESOLUTION=1400
TEMPLISTFILE=$(exec mktemp --suffix ".00.$BASENAME")
: > "$TEMPLISTFILE"
for (( I = 1; I <= ENDPAGE; ++I )); do
printf "%s\xFF%s\xFF%s\xFF%s\x00" "$F" "$RESOLUTION" "$I" "$OUTPUT"
done | parallel -0 -C $'\xFF' -j 4 -- convert_func '{1}' '{2}' '{3}' '{4}' "$TEMPLISTFILE"
while IFS=$'\t' read -r __ FILE; do
cat "$FILE"
rm -f "$FILE"
done < <(exec sort -n "$TEMPLISTFILE") > "$OUTPUT"
rm -f "$TEMPLISTFILE"
done

Resources