Removing all but most recent file whose name contains a substring

Removing all but most recent file whose name contains a substring - bash

I am looking for a bash script (or one-liner) to accomplish the following:
Check to see if there is more than one file containing the substring "slurm-"
If so, remove all of the files containing the substring except for the newest one
Any help would be greatly appreciated, thank you.

The above isn't exceptionally efficient with a very long list of files, but (1) it's fast with a short list (low constant-time startup cost), and (2) it's very explicit about how it operates (easy to read and understand).
shopt -s nullglob
candidates=( slurm-* )
(( ${#candidates[#]} < 2 )) && exit 0 ## nothing to do if <2 files exist
latest=${candidates[0]} ## populate latest variable w/ first
for candidate in "${candidates[#]}"; do ## loop through the whole set
if [[ $candidate -nt $latest ]]; then ## and if one is newer, call it "latest"
latest=$candidate
fi
done
for candidate in "${candidates[#]}"; do ## iterate through the whole set
if [[ $candidate != "$latest" ]]; then ## and for everything but the latest file
rm -f -- "$candidate" ## run a deletion
fi
done

Answering the XY problem, you might find it a better course of action to actually add #SBATCH -o output.txt to your submission file to overwrite the Slurm output file every time, if your intent is to keep a clean working directory while submitting several times in a row the same job until it properly runs.

Related

Renaming files with numbered 0-padding names while leaving pre-renamed ones alone

I have several .jpg files in serveral folders about 20K actually. The filename are different like 123.jpg, abc.jpg, ab12.jpg. What is need is to rename all those files using bash script with leading 0 pattern.
I had used one code down below and while I do this the everytime I add new files the previous files get renamed again. Could anyone help me out from this situation and this would be really help full. I have searched entire web for this and not find one :(
#!/bin/bash
num=0
for i in *.jpg
do
a=`printf "%05d" $num`
mv "$i" "filename_$a.jpg"
let "num = $(($num + 1))"
done
To provide a concrete example of the problem, consider:
touch foo.jpg bar.jpg baz.jpg
The first time this script is run, bar.jpg is renamed to filename_00000.jpg; baz.jpg is renamed to filename_00001.jpg; foo.jpg is renamed to filename_00002.jpg. This behavior is acceptable.
If someone then runs:
touch a.jpg
...and runs the script again, then it renames a.jpg to filename_00000.jpg, renames filename_00000.jpg (now a.jpg, as the old version got overwritten!) to filename_00001.jpg, renames filename_00001.jpg to filename_00002.jpg, etc.
How can I make the program leave the files already matching filename_#####.jpg alone, and rename new files to have numbers after the last one that already exists?

#!/bin/bash
shopt -s extglob # enable extended globbing -- regex-like syntax
prefix="filename_"
# Find the largest-numbered file previously renamed
num=0 # initialize counter to 0
for f in "$prefix"+([[:digit:]]).jpg; do # Iterate only over names w/ prefix/suffix
f=${f#"$prefix"} # strip the prefix
f=${f%.jpg} # strip the suffix
if (( 10#$f > num )); then # force base-10 evaluation
num=$(( 10#$f ))
fi
done
# Second pass: Iterate over *all* names, and rename the ones that don't match the pattern
for i in *.jpg; do
[[ $i = "$prefix"+([[:digit:]]).jpg ]] && continue # Skip files already matching pattern
printf -v a '%05d' "$num" # More efficient than subshell use
until mv -n -- "$i" "$prefix$a.jpg"; do # "--" forces parse of "$i" as name
[[ -e "$i" ]] || break # abort if source file disappeared
num=$((num + 1)) # if we couldn't rename, increment num
printf -v a '%05d' "$num" # ...and try again with the next name
done
num=$((num + 1)) # modern POSIX math syntax
done
Note the use of mv -n to prevent overwrites -- that way two copies of this script running at the same time won't overwrite each others' files.

grep files based on name prefixes

I have a question on how to approach a problem I've been trying to tackle at multiple points over the past month. The scenario is like so:
I have a a base directory with multiple sub-directories all following the same sub-directory format:
A/{B1,B2,B3} where all B* have a pipeline/results/ directory structure under them.
All of these results directories have multiple *.xyz files in them. These *.xyz files have a certain hierarchy based on their naming prefixes. The naming prefixes in turn depend on how far they've been processed. They could be, for example, select.xyz, select.copy.xyz, and select.copy.paste.xyz, where the operations are select, copy and paste. What I wish to do is write a ls | grep or a find that picks these files based on their processing levels.
EDIT:
The processing pipeline goes select -> copy -> paste. The "most processed" file would be the one with the most of those stages as prefixes in its filename. i.e. select.copy.paste.xyz is more processed than select.copy, which in turn is more processed than select.xyz
For example, let's say
B1/pipeline/results/ has select.xyz and select.copy.xyz,
B2/pipeline/results/ has select.xyz
B3/pipeline/results/ has select.xyz, select.copy.xyz, and select.copy.paste.xyz
How can I write a ls | grep/find that picks the most processed file from each subdirectory? This should give me B1/pipeline/results/select.copy.xyz, B2/pipeline/results/select.xyz and B3/pipeline/results/select.copy.paste.xyz.
Any pointer on how I can think about an approach would help. Thank you!

For this answer, we will ignore the upper part A/B{1,2,3} of the directory structure. All files in some .../pipeline/results/ directory will be considered, even if the directory is A/B1/doNotIncludeMe/forbidden/pipeline/results. We assume that the file extension xyz is constant.
A simple solution would be to loop over the directories and check whether the files exist from back to front. That is, check if select.copy.paste.xyz exists first. In case the file does not exist, check if select.copy.xyz exists and so on. A script for this could look like the following:
#! /bin/bash
# print paths of the most processed files
shopt -s globstar nullglob
for d in **/pipeline/result; do
if [ -f "$d/select.copy.paste.xyz" ]; then
echo "$d/select.copy.paste.xyz"
elif [ -f "$d/select.copy.xyz" ]; then
echo "$d/select.copy.xyz"
elif [ -f "$d/select.xyz" ]; then
echo "$d/select.xyz"
else
# there is no file at all
fi
done
It does the job, but is not very nice. We can do better!
#! /bin/bash
# print paths of the most processed files
shopt -s globstar nullglob
for dir in **/pipeline/result; do
for file in "$dir"/select{.copy{.paste,},}.xyz; do
[ -f "$file" ] && echo "$file" && break
done
done
The second script does exactly the same thing as the first one, but is easier to maintain, adapt, and so on. Both scripts work with file and directory names that contain spaces or even newlines.
In case you don't have whitespace in your paths, the following (hacky, but loop-free) script can also be used.
#! /bin/bash
# print paths of the most processed files
shopt -s globstar nullglob
files=(**/pipeline/result/select{.copy{.paste,},}.xyz)
printf '%s\n' "${files[#]}" | sed -r 's#(.*/)#\1 #' | sort -usk1,1 | tr -d ' '

bash script to retrieve the highest numbered folder/directory

I am currently running a program that rearranges genomes to create the best alignment to a reference genome, and as it does so it generates a number of folders like alignment#.
I have no way of knowing how many iterations this program will run through before it stops, but the final alignment is the best one (could be anything from alignment5 to alignment35) and will have a predictable filename within the folder, though the folder will be changeable.
I need a bash script that will look inside a directory and identify the highest-numbered directory and store it as a variable or similar, that could ideally be passed to an additional program.
I just wanted to add that my scripting is very basic. If you guys could explain your answers as thoroughly as possible or provide links to user-friendly resources that would be much appreciated.

A concept script here:
#!/bin/bash
shopt -s extglob || exit
DIR="/parent/dir" highest=
for a in "$DIR"/alignment+([[:digit:]]); do
b=${a##*/alignment}
[[ -z $highest || b -gt highest ]] && highest=$b
done
[[ -n $highest ]] && echo "Highest: $DIR/alignment${highest}"

ls -1 $directory | sort --numeric
This assumes a consistent prefix to the file names.
Otherwise, you can use "sort -k N --numeric", see "man sort" for details.

How to make sure that original file and gzipped version are not the same

My test equipment generates large text files which tend to grow in size over a period of several days as data is added.
But the text files are transferred to a PC for backup purposes daily, where they're compressed with gzip, even before they've finished growing.
This means I frequently have both file.txt and a compressed form file.txt.gz where the uncompressed file may be more up to date than the compressed version.
I decide which to keep with the following bash script gzandrm:
#!/usr/bin/bash
# Given an uncompressed file, look in the same directory for
# a gzipped version of the file and delete the uncompressed
# file if zdiff reveals they're identical. Otherwise, the
# file can be compressed.
# eg: find . -name '*.txt' -exec gzandrm {} \;
if [[ -e $1 && -e $1.gz ]]
then
# simple check: use zdiff and count the characters
DIFFS=$(zdiff "$1" "$1.gz" | wc -c)
if [[ $DIFFS -eq 0 ]]
then
# difference is '0', delete the uncompressed file
echo "'$1' already gzipped, so removed"
rm "$1"
else
# difference is non-zero, check manually
echo "'$1' and '$1.gz' are different"
fi
else
# go ahead and compress the file
echo "'$1' not yet gzipped, doing it now"
gzip "$1"
fi
and this has worked well, but it would make more sense to compare the modification dates of the files, since gzip does not change the modification date when it compresses, so two files with the same date are really the same file, even if one of them is compressed.
How can I modify my script to compare files by date, rather than size?

It's not entirely clear what the goal is, but it seems to be simple efficiency, so I think you should make two changes: 1) check modification times, as you suggest, and don't bother comparing content if the uncompressed file is no newer than the compressed file, and 2) use zcmp instead of zdiff.
Taking #2 first, your script does this:
DIFFS=$(zdiff "$1" "$1.gz" | wc -c)
if [[ $DIFFS -eq 0 ]]
which will perform a full diff of potentially large files, count the characters in diff's output, and examine the count. But all you really want to know is whether the content differs. cmp is better for that, since it will scan byte by byte and stop if it encounters a difference. It doesn't take the time to format a nice textual comparison (which you will mostly ignore); its exit status tells you the result. zcmp isn't quite as efficient as raw cmp, since it'll need to do an uncompress first, but zdiff has the same issue.
So you could switch to zcmp (and remove the use of a subshell, eliminate wc, not invoke [[, and avoid putting potentially large textual diff data into a variable) just by changing the above two lines to this:
if zcmp -s "$1" # if $1 and $1.gz are the same
To go a step further and check modification times first, you can use the -nt (newer than) option to the test command (also known as square bracket), rewriting the above line as this:
if [ ! "$1" -nt "$1.gz" ] || zcmp -s "$1"
which says that if the uncompressed version is no newer than the compressed version OR if they have the same content, then $1 is already gzipped and you can remove it. Note that if the uncompressed file is no newer, zcmp won't run at all, saving some cycles.
The rest of your script should work as is.
One caveat: modification times are very easy to change. Just moving the compressed file from one machine to another could change its modtime, so you'll have to consider your own case to know whether the modtime check is a valid optimization or more trouble than it's worth.

You can get an easy to compare date stamp of a file using stat with either the %Y or %Z format strings to get the time of last modification or change in seconds from epoch.
if [ $(stat -c %Z $1) -eq ($stat -c %Z $1.gz) ]; then
echo "Last changed time of $1 is the same as $1.gz"
fi

BASH script: Downloading consecutive numbered files with wget

I have a web server that saves the logs files of a web application numbered. A file name example for this would be:
dbsclog01s001.log
dbsclog01s002.log
dbsclog01s003.log
The last 3 digits are the counter and they can get sometime up to 100.
I usually open a web browser, browse to the file like:
http://someaddress.com/logs/dbsclog01s001.log
and save the files. This of course gets a bit annoying when you get 50 logs.
I tried to come up with a BASH script for using wget and passing
http://someaddress.com/logs/dbsclog01s*.log
but I am having problems with my the script.
Anyway, anyone has a sample on how to do this?
thanks!

#!/bin/sh
if [ $# -lt 3 ]; then
echo "Usage: $0 url_format seq_start seq_end [wget_args]"
exit
fi
url_format=$1
seq_start=$2
seq_end=$3
shift 3
printf "$url_format\\n" `seq $seq_start $seq_end` | wget -i- "$#"
Save the above as seq_wget, give it execution permission (chmod +x seq_wget), and then run, for example:
$ ./seq_wget http://someaddress.com/logs/dbsclog01s%03d.log 1 50
Or, if you have Bash 4.0, you could just type
$ wget http://someaddress.com/logs/dbsclog01s{001..050}.log
Or, if you have curl instead of wget, you could follow Dennis Williamson's answer.

curl seems to support ranges. From the man page:
URL
The URL syntax is protocol dependent. You’ll find a detailed descrip‐
tion in RFC 3986.
You can specify multiple URLs or parts of URLs by writing part sets
within braces as in:
http://site.{one,two,three}.com
or you can get sequences of alphanumeric series by using [] as in:
ftp://ftp.numericals.com/file[1-100].txt
ftp://ftp.numericals.com/file[001-100].txt (with leading zeros)
ftp://ftp.letters.com/file[a-z].txt
No nesting of the sequences is supported at the moment, but you can use
several ones next to each other:
http://any.org/archive[1996-1999]/vol[1-4]/part{a,b,c}.html
You can specify any amount of URLs on the command line. They will be
fetched in a sequential manner in the specified order.
Since curl 7.15.1 you can also specify step counter for the ranges, so
that you can get every Nth number or letter:
http://www.numericals.com/file[1-100:10].txt
http://www.letters.com/file[a-z:2].txt
You may have noticed that it says "with leading zeros"!

You can use echo type sequences in the wget url to download a string of numbers...
wget http://someaddress.com/logs/dbsclog01s00{1..3}.log
This also works with letters
{a..z} {A..Z}

Not sure precisely what problems you were experiencing, but it sounds like a simple for loop in bash would do it for you.
for i in {1..999}; do
wget -k http://someaddress.com/logs/dbsclog01s$i.log -O your_local_output_dir_$i;
done

You can use a combination of a for loop in bash with the printf command (of course modifying echo to wget as needed):
$ for i in {1..10}; do echo "http://www.com/myurl`printf "%03d" $i`.html"; done
http://www.com/myurl001.html
http://www.com/myurl002.html
http://www.com/myurl003.html
http://www.com/myurl004.html
http://www.com/myurl005.html
http://www.com/myurl006.html
http://www.com/myurl007.html
http://www.com/myurl008.html
http://www.com/myurl009.html
http://www.com/myurl010.html

Interesting task, so I wrote full script for you (combined several answers and more). Here it is:
#!/bin/bash
# fixed vars
URL=http://domain.com/logs/ # URL address 'till logfile name
PREF=logprefix # logfile prefix (before number)
POSTF=.log # logfile suffix (after number)
DIGITS=3 # how many digits logfile's number have
DLDIR=~/Downloads # download directory
TOUT=5 # timeout for quit
# code
for((i=1;i<10**$DIGITS;++i))
do
file=$PREF`printf "%0${DIGITS}d" $i`$POSTF # local file name
dl=$URL$file # full URL to download
echo "$dl -> $DLDIR/$file" # monitoring, can be commented
wget -T $TOUT -q $dl -O $file
if [ "$?" -ne 0 ] # test if we finished
then
exit
fi
done
At the beggiing of the script you can set URL, log file prefix and suffix, how many digits you have in numbering part and download directory. Loop will download all logfiles it found, and automaticaly exit on first non-existant (using wget's timeout).
Note that this script assumes that logfile indexing starts with 1, not zero, as you mentioned in example.
Hope this helps.

Here you can find a Perl script that looks like what you want
http://osix.net/modules/article/?id=677
#!/usr/bin/perl
$program="wget"; #change this to proz if you have it ;-)
my $count=1; #the lesson number starts from 1
my $base_url= "http://www.und.nodak.edu/org/crypto/crypto/lanaki.crypt.class/lessons/lesson";
my $format=".zip"; #the format of the file to download
my $max=24; #the total number of files to download
my $url;
for($count=1;$count<=$max;$count++) {
if($count<10) {
$url=$base_url."0".$count.$format; #insert a '0' and form the URL
}
else {
$url=$base_url.$count.$format; #no need to insert a zero
}
system("$program $url");
}

I just had a look at the wget manpage discussion of 'globbing':
By default, globbing will be turned on if the URL contains a globbing character. This option may be used to turn globbing on or off permanently.
You may have to quote the URL to protect it from being expanded by your shell. Globbing makes Wget look for a directory listing, which is system-specific. This is why it currently works only with Unix FTP servers (and the ones emulating Unix "ls" output).
So wget http://... won't work with globbing.

Check to see if your system has seq, then it would be easy:
for i in $(seq -f "%03g" 1 10); do wget "http://.../dbsclog${i}.log"; done
If your system has the jot command instead of seq:
for i in $(jot -w "http://.../dbsclog%03d.log" 10); do wget $i; done

Oh! this is a similar problem I ran into when learning bash to automate manga downloads.
Something like this should work:
for a in `seq 1 999`; do
if [ ${#a} -eq 1 ]; then
b="00"
elif [ ${#a} -eq 2 ]; then
b="0"
fi
echo "$a of 231"
wget -q http://site.com/path/fileprefix$b$a.jpg
done

Late to the party, but a real easy solution that requires no coding is to use the DownThemAll Firefox add-on, which has the functionality to retrieve ranges of files. That was my solution when I needed to download 800 consecutively numbered files.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio