bash iterate over a directory sorted by file size - bash

As a webmaster, I generate a lot of junk files of code. Periodically I have to purge the unneeded files filtered by extention. Example: "cleaner txt" Easy enough. But I want to sort the files by size and process them for the "for" loop. How can I do that?
cleaner:
#/bin/bash
if [ -z "$1" ]; then
echo "Please supply the filename suffixes to delete.";
exit;
fi;
filter=$1;
for FILE in *.$filter; do clear;
cat $FILE; printf '\n\n'; rm -i $FILE; done

You can use a mix of find (to print file sizes and names), sort (to sort the output of find) and cut (to remove the sizes). In case you have very unusual file names containing any possible character including newlines, it is safer to separate the files by a character that cannot be part of a name: NUL.
#/bin/bash
if [ -z "$1" ]; then
echo "Please supply the filename suffixes to delete.";
exit;
fi;
filter=$1;
while IFS= read -r -d '' -u 3 FILE; do
clear
cat "$FILE"
printf '\n\n'
rm -i "$FILE"
done 3< <(find . -mindepth 1 -maxdepth 1 -type f -name "*.$filter" \
-printf '%s\t%p\0' | sort -zn | cut -zf 2-)
Note that we must use a different file descriptor than stdin (3 in this example) to pass the file names to the loop. Else, if we use stdin, it will also be used to provide the answers to rm -i.

Inspired from this answer, you could use the find command as follows:
find ./ -type f -name "*.yaml" -printf "%s %p\n" | sort -n
find command prints the the size of the files and the path so that the sort command prints the results from the smaller one to the larger.
In case you want to iterate through (let's say) the 5 bigger files you can do something like this using the tail command like this:
for f in $(find ./ -type f -name "*.yaml" -printf "%s %p\n" |
sort -n |
cut -d ' ' -f 2)
do
echo "### $f"
done

If the file names don't contain newlines and spaces
while read filesize filename; do
printf "%-25s has size %10d\n" "$filename" "$filesize"
done < <(du -bs *."$filter"|sort -n)
while read filename; do
echo "$filename"
done < <(du -bs *."$filter"|sort -n|awk '{$0=$2}1')

Related

For Loop: Identify Filename Pairs, Input to For Loop

I am attempting to adapt a previously answered question for use in a for loop.
I have a folder containing multiple paired file names that need to be provided sequentially as input to a for loop.
Example Input
WT1_0min-SRR9929263_1.fastq
WT1_0min-SRR9929263_2.fastq
WT1_20min-SRR9929265_1.fastq
WT1_20min-SRR9929265_2.fastq
WT3_20min-SRR12062597_1.fastq
WT3_20min-SRR12062597_2.fastq
Paired file names can be identified with the answer from the previous question:
find . -name '*_1.fastq' -exec basename {} '_1.fastq' \; | xargs -n1 -I{} echo {}_1.fastq {}_2.fastq
I now want adopt this for use in a for loop so that each output file can be independently piped to subsequent commands and also so that output file names can be appended.
Input files can be provided as a comma-separated list of files after the -1 and -2 flags respectively. So for this example, the bulk and undesired input would be:
-1 WT1_0min-SRR9929263_1.fastq,WT1_20min-SRR9929265_1.fastq,WT3_20min-SRR12062597_1.fastq
-2 WT1_0min-SRR9929263_2.fastq,WT1_20min-SRR9929265_2.fastq,WT3_20min-SRR12062597_2.fastq
However, I would like to run this as a for loop so that input files are provided sequentially:
Iteration #1
-1 WT1_0min-SRR9929263_1.fastq
-2 WT1_0min-SRR9929263_2.fastq
Iteration #2
-1 WT1_20min-SRR9929265_1.fastq
-2 WT1_20min-SRR9929265_2.fastq
Iteration #3
-1 WT3_20min-SRR12062597_1.fastq
-2 WT3_20min-SRR12062597_2.fastq
Below is an example of the for loop I would like to run using the xarg code to pull filenames. It currently does not work. I assume I need to somehow save the paired filenames from the xarg code as a variable that can be referenced in the for loop?
find . -name '*_1.fastq' -exec basename {} '_1.fastq' \; | xargs -n1 -I{} echo {}_1.fastq {}_2.fastq
for file in *.fastq
do
bowtie2 -p 8 -x /path/genome \
1- {}_1.fastq \
2- {}_2.fastq \
"../path/${file%%.fastq}_UnMappedReads.fastq.gz" \
2> "../path/${file%%.fastq}_Bowtie2_log.txt" | samtools view -# 7 -b | samtools sort -# 7 -m 5G -o "../path/${file%%.fastq}_Mapped.bam"
done
The expected outputs for the example would be:
WT1_0min-SRR9929263_UnMappedReads.fastq.gz
WT1_20min-SRR9929265_UnMappedReads.fastq.gz
WT3_20min-SRR12062597_UnMappedReads.fastq.gz
WT1_0min-SRR9929263_Bowtie2_log.txt
WT1_20min-SRR9929265_Bowtie2_log.txt
WT3_20min-SRR12062597_Bowtie2_log.txt
WT1_0min-SRR9929263_Mapped.bam
WT1_20min-SRR9929265_Mapped.bam
WT3_20min-SRR12062597_Mapped.bam
I don't know what "bowtie2" or "samtools" are but best I can tell all you need is:
#!/usr/bin/env bash
for file1 in *_1.fastq; do
file2="${file1%_1.fastq}_2.fastq"
echo "$file1" "$file2"
done
Replace echo with whatever you want to do with ta pair of files.
If you HAD to use find for some reason then it'd be:
#!/usr/bin/env bash
while IFS= read -r file1; do
file2="${file1%_1.fastq}_2.fastq"
echo "$file1" "$file2"
done < <(find . -type f -name '*_1.fastq' -print)
or if your file names can contain newlines then:
#!/usr/bin/env bash
while IFS= read -r -d $'\0' file1; do
file2="${file1%_1.fastq}_2.fastq"
echo "$file1" "$file2"
done < <(find . -type f -name '*_1.fastq' -print0)

How can i sort a Array based on a not integer Substring in Bash?

I wrote a cleanup Script to delete some certain files. The files are stored in Subfolders. I use find to get those files into a Array and its recursive because of find. So an Array entry could look like this:
(path to File)
./2021_11_08_17_28_45_1733556/2021_11_12_04_15_51_1733556_0.jfr
As you can see the filenames are Timestamps. Find sorts by the Folder name only (./2021_11_08_17_28_45_1733556) but I need to sort all Files which can be in different Folders by the timestamp only of the files and not of the folders (they can be completely ignored), so I can delete the oldest files first. Here you can find my Script at the not properly working state, I need to add some sorting to fix my problems.
Any Ideas?
#!/bin/bash
# handle -h (help)
if [[ "$1" == "-h" || "$1" == "" ]]; then
echo -e '-p [Pfad zum Zielordner] \n-f [Anzahl der Files welche noch im Ordner vorhanden sein sollen] \n-d [false um dryRun zu deaktivieren]'
exit 0
fi
# handle parameters
while getopts p:f:d: flag
do
case "${flag}" in
p) pathToFolder=${OPTARG};;
f) maxFiles=${OPTARG};;
d) dryRun=${OPTARG};;
*) echo -e '-p [Pfad zum Zielordner] \n-f [Anzahl der Files welche noch im Ordner vorhanden sein sollen] \n-d [false um dryRun zu deaktivieren]'
esac
done
if [[ -z $dryRun ]]; then
dryRun=true
fi
# fill array specified by .jfr files an sorted that the oldest files get deleted first
fillarray() {
files=($(find -name "*.jfr" -type f))
totalFiles=${#files[#]}
}
# Return size of file
getfilesize() {
filesize=$(du -k "$1" | cut -f1)
}
count=0
checkfiles() {
# Check if File matches the maxFiles parameter
if [[ ${#files[#]} -gt $maxFiles ]]; then
# Check if dryRun is enabled
if [[ $dryRun == "false" ]]; then
echo "msg=\"Removal result\", result=true, file=$(realpath $1) filesize=$(getfilesize $1), reason=\"outside max file boundary\""
((count++))
rm $1
else
((count++))
echo msg="\"Removal result\", result=true, file=$(realpath $1 ) filesize=$(getfilesize $1), reason=\"outside max file boundary\""
fi
# Remove the file from the files array
files=(${files[#]/$1})
else
echo msg="\"Removal result\", result=false, file=$( realpath $1), reason=\"within max file boundary\""
fi
}
# Scan for empty files
scanfornullfiles() {
for file in "${files[#]}"
do
filesize=$(! getfilesize $file)
if [[ $filesize == 0 ]]; then
files=(${files[#]/$file})
echo msg="\"Removal result\", result=false, file=$(realpath $file), reason=\"empty file\""
fi
done
}
echo msg="jfrcleanup.sh started", maxFiles=$maxFiles, dryRun=$dryRun, directory=$pathToFolder
{
cd $pathToFolder > /dev/null 2>&1
} || {
echo msg="no permission in directory"
echo msg="jfrcleanup.sh stopped"
exit 0
}
fillarray #> /dev/null 2>&1
scanfornullfiles
for file in "${files[#]}"
do
checkfiles $file
done
echo msg="\"jfrcleanup.sh finished\", totalFileCount=$totalFiles filesRemoved=$count"
Assuming the file paths do not contain newline characters, would tou please try
the following Schwartzian transform method:
#!/bin/bash
pat="/([0-9]{4}(_[0-9]{2}){5})[^/]*\.jfr$"
while IFS= read -r -d "" path; do
if [[ $path =~ $pat ]]; then
printf "%s\t%s\n" "${BASH_REMATCH[1]}" "$path"
fi
done < <(find . -type f -name "*.jfr" -print0) | sort -k1,1 | head -n 1 | cut -f2- | tr "\n" "\0" | xargs -0 echo rm
The string pat is a regex pattern to extract the timestamp from the
filename such as 2021_11_12_04_15_51.
Then the timestamp is prepended to the filename delimited by a tab
character.
The output lines are sorted by the timestamp in ascending order
(oldest first).
head -n 1 picks the oldest line. If you want to change the number of files
to remove, modify the number to the -n option.
cut -f2- drops the timestamp to retrieve the filename.
tr "\n" "\0" protects the filenames which contain whitespaces or
tab characters.
xargs -0 echo rm just outputs the command lines as a dry run.
If the output looks good, drop echo.
If you have GNU find, and pathnames don't contain new-line ('\n') and tab ('\t') characters, the output of this command will be ordered by basenames:
find path/to/dir -type f -printf '%f\t%p\n' | sort | cut -f2-
TL;DR but Since you're using find and if it supports the -printf flag/option something like.
find . -type f -name '*.jfr' -printf '%f/%h/%f\n' | sort -k1 -n | cut -d '/' -f2-
Otherwise a while read loop with another -printf option.
#!/usr/bin/env bash
while IFS='/' read -rd '' time file; do
printf '%s\n' "$file"
done < <(find . -type f -name '*.jfr' -printf '%T#/%p\0' | sort -zn)
That is -printf from find and the -z flag from sort is a GNU extension.
Saving the file names you could change
printf '%s\n' "$file"
To something like, which is an array named files
files+=("$file")
Then "${files[#]}" has the file names as elements.
The last code with a while read loop does not depend on the file names but the time stamp from GNU find.
I solved the problem! I sort the array with the following so the oldest files will be deleted first:
files=($(printf '%s\n' "${files[#]}" | sort -t/ -k3))
Link to Solution

rename files in a folder using find shell

i have a n files in a different folders like abc.mp3 acc.mp3 bbb.mp3 and i want to rename them 01-abc.mp3, 02-acc.mp3, 03-bbb.mp3... i tried this
#!/bin/bash
IFS='
'
COUNT=1
for file in ./uff/*;
do mv "$file" "${COUNT}-$file" let COUNT++ done
but i keep getting errors like for syntax error near 'do and sometimes for not found... Can someone provide single line solution to this using "find" from terminal. i'm looking for a solution using find only due to certain constraints... Thanks in advance
I'd probably use:
#!/bin/bash
cd ./uff || exit 1
COUNT=1
for file in *.mp3;
do
mv "$file" $(printf "%.2d-%s" ${COUNT} "$file")
((COUNT++))
done
This avoids a number of issues and also includes a 2-digit number for the first 9 files (the next 90 get 2-digit numbers anyway, and after that you get 3-digit numbers, etc).
you can try this;
#!/bin/bash
COUNT=1
for file in ./uff/*;
do
path=$(dirname $file)
filename=$(basename $file)
if [ $COUNT -lt 10 ]; then
mv "$file" "$path"/0"${COUNT}-$filename";
else
mv "$file" "$path"/"${COUNT}-$filename";
fi
COUNT=$(($COUNT+1));
done
Eg:
user#host:/tmp/test$ ls uff/
abc.mp3 acc.mp3 bbb.mp3
user#host:/tmp/test$ ./test.sh
user#host:/tmp/test$ ls uff/
01-abc.mp3 02-acc.mp3 03-bbb.mp3
Ok, here's the version without loops:
paste -d'\n' <(printf "%s\n" *) <(printf "%s\n" * | nl -w1 -s-) | xargs -d'\n' -n2 mv -v
You can also use find if you want:
paste -d'\n' <(find -mindepth 1 -maxdepth 1 -printf "%f\n") <(find -mindepth 1 -maxdepth 1 -printf "%f\n" | nl -w1 -s-) | xargs -d'\n' -n2 mv -v
Replace mv with echo mv for the "dry run":
paste -d'\n' <(printf "%s\n" *) <(printf "%s\n" * | nl -w1 -s-) | xargs -d'\n' -n2 echo mv -v
Here's a solution.
i=1
for f in $(find ./uff -mindepth 1 -maxdepth 1 -type f | sort)
do
n=$i
[ $i -lt 10 ] && n="0$i"
echo "$f" "$n-$(basename "$f")"
((i++))
done
And here it is as a one-liner (but in real life if you ever tried anything remotely like what's below in a coding or ops interview you'd not only fail to get the job, you'd probably give the interviewer PTSD. They'd wake up in cold sweats thinking about how terrible your solution was).
i=1; for f in $(find ./uff -mindepth 1 -maxdepth 1 -type f | sort); do n=$i; [ $i -lt 10 ] && n="0$i"; echo "$f" "$n-$(basename "$f")" ; ((i++)); done
Alternatively, you could just cd ./uff if you wanted the rename them in the same directory, and then use find . (along with the other find arguments) to clear everything up. I'm assuming you only want files moved, not directories. And I'm assuming you don't want to recursively rename files / directories.

renumbering image files to be contiguous in bash

I have a directory with image files that follow a naming scheme and are not always contiguous. e.i:
IMG_33.jpg
IMG_34.jpg
IMG_35.jpg
IMG_223.jpg
IMG_224.jpg
IMG_225.jpg
IMG_226.jpg
IMG_446.jpg
I would like to rename them so they go something like this, in the same order:
0001.jpg
0002.jpg
0003.jpg
0004.jpg
0005.jpg
0006.jpg
0007.jpg
0008.jpg
So far this is what I came up, and while it does the four-digit padding, it doesn't sort by the number values in the filenames.
#!/bin/bash
X=1;
for i in *; do
mv $i $(printf %04d.%s ${X%.*} ${i##*.})
let X="$X+1"
done
result:
IMG_1009.JPG 0009.JPG
IMG_1010.JPG 0010.JPG
IMG_101.JPG 0011.JPG
IMG_102.JPG 0012.JPG
Update:
Try this. If output is okay remove echo.
X=1; find . -maxdepth 1 -type f -name "*.jpg" -print0 | sort -z -n -t _ -k2 | while read -d $'\0' -r line; do echo mv "$line" "$(printf "%04d%s" $X .jpg)"; ((X++)); done
Using the super helpful rename. First, pads files with one digit to two digits; then pads files with two digits to three digits; etc.
rename IMG_ IMG_0 IMG_?.jpg
rename IMG_ IMG_0 IMG_??.jpg
rename IMG_ IMG_0 IMG_???.jpg
Then, your for-loop (or another similar one) that renames does the trick as the files are in both alphabetical and numerical order.
how about this :
while read f1;do
echo $f1
mv IMG_$f1 $f1
done< <(ls | cut -d '_' -f 2 | sort -n)
thanks
Michael

Is there a 'better' way to find a list of files in a directory tree

I have created a list of files using find, foundlist.lst.
The find command is simply find . -type f -name "<search_pattern>" > foundlist.lst
I would now like to use this list to find copies of these files in other directories.
The 'twist' in my requirements is that I want to search only for the 'base' of the file name. I don't want to include the extension in the search.
Example:
./sort.cc is a member of the list. I want to look for all files of the pattern sort.*
Here is what I wrote. It works. It seems to me that there is a more efficient way to do this.
./findfiles.sh foundfiles.lst /usr/bin/temp
#!/bin/bash
# findfiles.sh
if [ $# -ne 2 ]; then
echo "Need two arguments"
echo "usage: findfiles <filelist> <dir_to_search>"
else
filename=$1
echo "$filename"
while read -r line; do
name=$line
# change './file.ext' to 'file.*'
search_base=$( echo ${name} | sed "s%\.\/%%" | sed "s/\..*/\.\*/" )
find $2 -type f -name $search_base
done < $filename
fi
For stripping the file, I'd use the following (instead of awk)
search_base=`basename ${name} | cut -d'.' -f1`

Resources