bash script to rename files based on a calculation - bash

I have a file system containing PNG images. The layout of the filesystem is: ZOOM/X/Y.png where ZOOM, X, and Y are all integers.
I need to change the names of the PNG files. Basically, I need to convert Y from its current value to 2^ZOOM-Y-1. I've written a bash script to accomplish this task. However, I suspect it can be optimized substantially. (I also suspect that I may have been better off writing it in Perl, but that is another story.)
Here is the script. Is this about as good as it gets? Or can the performance be optimized? Are there tools I can use that would profile the script for me and tell me where I'm spending all my execution time?
#!/bin/bash
tiles=`ls -d */*/*`
for oldPath in $tiles
do
oldY=`basename -s .png $oldPath`
zoomX=`dirname $oldPath`
zoom=`echo $zoomX | sed 's#\([^\]\)/.*#\1#'`
newY=`echo 2^$zoom-$oldY-1|bc`
mv ${zoomX}/${oldY}.png ${zoomX}/${newY}.png
done

for oldpath in */*/*
do
x=$(basename "$oldpath" .png)
zoom_y=$(dirname "$oldpath")
y=$(basename "$zoom_y")
ozoom=$(dirname "$zoom_y")
nzoom=$(echo "2^$zoom-$y-1" | bc)
mv "$oldpath" $nzoom/$y/$x.png
done
This avoids using sed. I like basename and dirname. However, you can also use bash (and Korn) shell notations such as:
y=${zoom_y#*/}
ozoom=${zoom_y%/*}
You might be able to do it all without invoking basename or dirname at all.

REWRITE due to misunderstanding of the formula and the updated var names. Still no subprocesses apart from mv and ls.
#!/bin/bash
tiles=`ls -d */*/*`
for thisPath in $tiles
do
thisFile=${thisPath#*/*/}
oldY=${thisFile%.png}
zoomX=${thisPath%/*}
zoom=${thisPath%/*/*}
newY=$(((1<<zoom) - oldY - 1))
mv ${zoomX}/${oldY}.png ${zoomX}/${newY}.png
done

It's likely that the overall throughput of your rename is limited by the filesystem. Choosing the right filesystem and tuning it for this sort of operation would speed up the overall job much more than tweaking the script.
If you optimize the script you'll probably see less CPU consumed but the same total duration. Since forking off the various subprocesses (basename, dirname, sed, bc) are probably more significant than the actual work you are probably right that a perl implementation would use less CPU because it can do all of those operations internally (including the mv).

I see 3 improvements I would do, if it was my script. Whether they have an huge impact - I don't think so.
But you should avoid as hell parsing the output of ls. Maybe this directory is very predictable, from the things found inside, but if I read your script correctly, you can use the globbing with for directly:
for thisPath in */*/*
repeatedly, $(cmd) is better than cmd with the deprecated backticks, which aren't nestable.
thisDir=$(dirname $thisPath)
arithmetic in bash directly:
newTile=$((2**$zoom-$thisTile-1))
as long as you don't need floating point, or output is getting too big.
I don't get the sed-part:
zoom=`echo $zoomX | sed 's#\([^\]\)/.*#\1#'`
Is there something missing after the backslash? A second one? You're searching for something which isn't a backslash, followed by a slash-something? Maybe it could be done purely in bash too.

one precept of computing credited to Donald Knuth is, "don't optimize too early." Scripts run pretty fast and 'mv' operations(as long as they're not going across filesystems where you're really copying it to another disk and then deleting the file) are pretty fast as well, as all the filesystem has to do in most cases is just rename the file or change its parentage.
Probably where it's spending most of its time is in that intial 'ls' operation. I suspect you have ALOT of files. There isn't much that can be done there. Doing it another language like perl or python is going to face the same hurdle. However you might be able to get more INTELLIGENCE and not limit yourself to 3 levels(//*).

Related

How to label files with incrementing numbers in bash/Cygwin?

At work, I frequently generate series of .png images produced by an oscilloscope while testing circuits. I like to label the images with descriptive titles afterward to keep track of which image was for which measurement. Normally I use Cygwin to rename them in batches and then go back manually to add numbers to their names, but this is very tedious if there are a lot of samples from different tests. I am trying to write a bash script that will work to label them quickly and easily.
For example, if I have the files
scope1.png,
scope2.png,
scope3.png,
scope4.png
how would I write a bash script that could label them as
circuit_1_sample_1.png,
circuit_1_sample_2.png,
circuit_2_sample_1.png,
circuit_2_sample_2.png
I could probably do this quite easily in python, but is there an easy way to make bash or Cygwin do this?
Thanks.
If all scopes are for the same circuit:
for f in scope*.png; do mv "$f" "circuit_1_sample_${f/scope/}"; done
if only scopes 9-12 are for the same circuit:
for f in scope{9..12}.png; do ...
If it's literally exactly like your question where there are two scopes per circuit:
for f in scope*.png; do
num=${f//[^0-9]/}
mv "$f" "circuit_$(( num / 2 ))_sample_$(( num % 2)).png"
done
Globing or * works like you'd expect, it finds all files in the current directory that could match that pattern, with * as a multi-character wildcard. $(( )) is just bash arithmetic, and works like you'd expect (integer only). the ${f/...} stuff is bash parameter expansion. It's pretty cool; you should read about it.

Bash: Trying to append to a variable name in the output of a function

this is my very first post on Stackoverflow, and I should probably point out that I am EXTREMELY new to a lot of programming. I'm currently a postgraduate student doing projects involving a lot of coding in various programs, everything from LaTeX to bash, MATLAB etc etc.
If you could explicitly explain your answers that would be much appreciated as I'm trying to learn as I go. I apologise if there is an answer else where that does what I'm trying to do, but I have spent a couple of days looking now.
So to the problem I'm trying to solve: I'm currently using a selection of bioinformatics tools to analyse a range of genomes, and I'm trying to somewhat automate the process.
I have a few sequences with names that look like this for instance (all contained in folders of their own currently as paired files):
SOL2511_S5_L001_R1_001.fastq
SOL2511_S5_L001_R2_001.fastq
SOL2510_S4_L001_R1_001.fastq
SOL2510_S4_L001_R2_001.fastq
...and so on...
I basically wish to automate the process by turning these in to variables and passing these variables to each of the programs I use in turn. So for example my idea thus far was to assign them as wildcards, using the R1 and R2 (which appears in all the file names, as they represent each strand of DNA) as follows:
#!/bin/bash
seq1=*R1_001*
seq2=*R2_001*
On a rudimentary level this works, as it returns the correct files, so now I pass these variables to my first function which trims the DNA sequences down by a specified amount, like so:
# seqtk is the program suite, trimfq is a function within it,
# and the options -b -e specify how many bases to trim from the beginning and end of
# the DNA sequence respectively.
seqtk trimfq -b 10 -e 20 $seq1 >
seqtk trimfq -b 10 -e 20 $seq2 >
So now my problem is I wish to be able to append something like "_trim" to the output file which appears after the >, but I can't find anything that seems like it will work online.
Alternatively, I've been hunting for a script that will take the name of the folder that the files are in, and create a variable for the folder name which I can then give to the functions in question so that all the output files are named correctly for use later on.
Many thanks in advance for any help, and I apologise that this isn't really much of a minimum working example to go on, as I'm only just getting going on all this stuff!
Joe
EDIT
So I modified #ghoti 's for loop (does the job wonderfully I might add, rep for you :D ) and now I append trim_, as the loop as it was before ended up giving me a .fastq.trim which will cause errors later.
Is there any way I can append _trim to the end of the filename, but before the extension?
Explicit is usually better than implied, when matching filenames. Your wildcards may match more than you expect, especially if you have versions of the files with "_trim" appended to the end!
I would be more precise with the wildcards, and use for loops to process the files instead of relying on seqtk to handle multiple files. That way, you can do your own processing on the filenames.
Here's an example:
#!/bin/bash
# Define an array of sequences
sequences=(R1_001 R2_001)
# Step through the array...
for seq in ${sequences[#]}; do
# Step through the files in this sequence...
for file in SOL*_${seq}.fastq; do
seqtk trimfq -b 10 -e 20 "$file" > "${file}.trim"
done
done
I don't know how your folders are set up, so I haven't addressed that in this script. But the basic idea is that if you want the script to be able to manipulate individual filenames, you need something like a for loop to handle the that manipulation on a per-filename basis.
Does this help?
UPDATE:
To put _trim before the extension, replace the seqtk line with the following:
seqtk trimfq -b 10 -e 20 "$file" > "${file%.fastq}_trim.fastq"
This uses something documented in the Bash man page under Parameter Expansion if you want to read up on it. Basically, the ${file%.fastq} takes the $file variable and strips off a suffix. Then we add your extra text, along with the suffix.
You could also strip an extension using basename(1), but there's no need to call something external when you can use something built in to the shell.
Instead of setting variables with the filenames, you could pipe the output of ls to the command you want to run with these filenames, like this:
ls *R{1,2}_001* | xargs -I# sh -c 'seqtk trimfq -b 10 -e 20 "$1" > "${1}_trim"' -- #
xargs -I# will grab the output of the previous command and store it in # to be used by seqtk

What platform independent way to find directory of shell executable in shell script?

According to POSIX:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sh.html
there are some cases where it not obvious. For example:
If the file is not in the current working directory,
the implementation may perform a search for an executable
file using the value of PATH, as described in Command Search and Execution.
My Bash 4.x doesn't follow this optional rule (due to security concern??) so I can't test how it be in real life...
What platform independent way to find directory of shell executable in shell script?
PS. Also dirname $0 case fail with:
#!/bin/sh
echo $0
dirname $0
when you:
$ sh runme.sh
runme.sh
.
So you need something like:
CMDPATH=`cd $(dirname $0); echo $PWD`
To made code dependent only on built-in shell capabilities I rewrite code to:
PREVPWD=$PWD
cd ${0%${0##*/}}.
CMDPATH=$PWD
cd $PREVPWD
This look ugly but doesn't require fork any executables...
EDIT3:
Though not strictly POSIX yet, realpath is a GNU core app since 2012. Full disclosure: never heard of it before I noticed it in the info coreutils TOC and immediately thought of this question, but using the following function as demonstrated should reliably, (soon POSIXLY?), and, I hope, efficiently
provide its caller with an absolutely sourced $0:
% _abs_0() {
> o1="${1%%/*}"; ${o1:="${1}"}; ${o1:=`realpath -s "${1}"`}; eval "$1=\${o1}";
> }
% _abs_0 ${abs0:="${0}"} ; printf %s\\n "${abs0}"
/no/more/dots/in/your/path2.sh
EDIT4: It may be worth highlighting that this solution uses POSIX parameter expansion to first check if the path actually needs expanding and resolving at all before attempting to do so. This should return an absolutely sourced $0via a messenger variable (with the notable exception that -s will preserve symlinks) as efficiently as I could imagine it could be done whether or not the path is already absolute.
EDIT2:
Now I believe I understand your problem much better which, unfortunately, actually renders most of the below irrelevant.
(minor edit: before finding realpath in the docs, I had at least pared down my version of this not to depend on the time field, but, fair warning, after testing some I'm less convinced ps is fully reliable in its command path expansion capacity)
On the other hand, you could do this:
ps ww -fp $$ | grep -Eo '/[^:]*'"${0#*/}"
eval "abs0=${`ps ww -fp $$ | grep -Eo ' /'`#?}"
I need to fix it to work better with fields instead of expecting the time field to come just before the process's path and relying on its included colon as a reference, especially because this will not work with a colon in your process's path, but that's trivial and will happen soon, I think. The functionality is otherwise POSIX compliant, I believe. Probably parameter expansion alone can do what is necessary, I think.
Not strictly relevant (or correct):
This should work in every case that conforms to POSIX guidelines:
echo ${0%/*}
EDIT:
So I'll confess that, at least at first blush, I don't fully understand the issue you describe. Obviously in your question you demonstrate some familiarity with POSIX standards for variable string manipulation via parameter expansion (even if your particular implementation seems slightly strained at a glance), so it's likely I'm missing some vital piece of information in my interpretation of your question and perhaps, at least in its current form, this is not the answer you seek.
I have posted before on parameter expansion for inline variable null/set tests which may or may not be of use to you as you can see at the "Portable Way to Check Emptiness of a Shell Variable" question. I mention this mainly because my answer there was in large part copied/pasted from the POSIX guidelines on parameter expansion, includes an anchored link to the guidelines coverage on this subject, and a few examples from both the canonical documentation and my own perhaps less expertly demonstrated constructs.
I will freely admit however, that while I do not yet fully understand what it is you ask, I don't believe that you will find a specific answer there. Instead I suspect you may have forgotten, as I do occasionally, that the # and % operators in POSIX string manipulation are used to specify the part of the string you want to remove, not that you wish to retain as some might find more intuitive. What I mean is any string slice you search for in this way is designed to disappear from your output, which will then be only what your remains of your original string after your specified search string is removed.
So here's a bit of an overview:
Whereas a single instance of either operator will remove only as little as possible to fully satisfy your search, but when doubly instanced the search is called in a greedy form and removes as much of the original string as your search could possibly allow.
Other than that you need only know some basic regex and remember that # begins its search for your removal string from the left and scans through to the right, and that % begins instead its search from the right and scans through to the left.
## short example before better learning if I'm on the right track
## demonstrating path manipulation with '#' and '%'
% _path_one='/one/two/three/four.five'
% _path_two='./four.five'
## short searching from the right with our wildcard to the right
## side of a single character removes everything to the right of
## of the specified character and the character itself
## this is a very simple means of stripping extensions and paths
% echo ${_path_one%.*} ${_path_one%/*}
/one/two/three/four /one/two/three
## long searching from the left with the wildcard to the left of
## course produces opposite results
% echo ${_path_one##*.} ${_path_one##*/}
five four.five
## will soon come back to show more probably
I believe you can get it using readlink:
scriptPath=$(readlink -f -- "$0")
scriptDirectory=${scriptPath%/*}

how much should I worry about argument list too long?

I have a shell script, which will use some * to do wildcard. For example:
mv /someplace/*.DAT /someotherplace
And
for file in /someplace/*.DAT
do
echo $file
done
Then when I think about error handling, I am worrying about the infamuse argument list too long error.
How much should I worry about it? Actually how long can the shell holds? For example, will it dies at 500 files or 1000 files? Does it depends on the length of the filenames?
EDIT:
I have found out the argument max is 131072 bytes. I am not looking for solution to overcome argument too long problem. What I really what to need is -- How long does it translate to normal string command? i.e : How "long" would that be the command? Does it count space?
pardon my ignorance
If i remember correctly, is capped at 32Kb of data
first command
find /someplace -name '*.DAT' -print0 | xargs -r0 mv --target='/someotherplace'
second command
find /someplace -type f -name "*.DAT"
Yes, it depends on filename length. The command line maximum is a single hardcoded limit, so long filenames will exhaust it faster. And it's usually a kernel limitation, so there is no way around it within bash. And yes, this is serious: errors that occur only infrequently are always more serious than obvious errors, because quality assurance will probably miss them, and when they do happen it is almost guaranteed to be with a nightmarish unreadable command line that you can't even reconstruct properly!
For all these reasons: deal with the problem now rather than later.
Whether
How much should you worry about it? You may as well ask "What is the lifespan of my code?"
I would urge you to always worry about the argument list limit. This limit is set at compile time and can easily be different on different systems, shells, etc.. Do you know for sure that your code will always run in its original environment with expected input and that environment's original limit?
If the expansion of a glob could result in an unknown number of files or files with an unknown length being expanded or that expansion could exceed the limit that will be in effect in any unknown future environment then you should write your code from day one so as to avoid this bug.
How
There are three find-based solution for this problem. The classic solution uses xargs
find ... | xargs command
xargs will execute command with as many matches as it can without overflowing the argument list, then repeat that invocation as necessary until there are no more results from find.
This solution is problematic because file names may contain newlines. If you're lucky you have a nicer version of find which supports null-terminating file names with -print0 and you can use the safer solution
find ... -print0 | xargs -0 command
This is the same as the first find except it's safe for all legal file names.
Newer versions of find may support -exec with the + terminator, which allows for another solution
find ... -exec command {} +
This is functionally identical to the second find command above: safe for all file names, splits invocations of command into chunks that won't overflow the argument list. I prefer this form, when available.

BASH Expression to replace beginning and ending of a string in one operation?

Here's a simple problem that's been bugging me for some time. I often find I have a number of input files in some directory, and I want to construct output file names by replacing beginning and ending portions. For example, given this:
source/foo.c
source/bar.c
source/foo_bar.c
I often end up writing BASH expressions like:
for f in source/*.c; do
a="obj/${f##*/}"
b="${a%.*}.obj"
process "$f" "$b"
done
to generate the commands
process "source/foo.c" "obj/foo.obj"
process "source/bar.c "obj/bar.obj"
process "source/foo_bar.c "obj/foo_bar.obj"
The above works, but its a lot wordier than I like, and I would prefer to avoid the temporary variables. Ideally there would be some command that could replace the beginning and ends of a string in one shot, so that I could just write something like:
for f in source/*.c; do process "$f" "obj/${f##*/%.*}.obj"; done
Of course, the above doesn't work. Does anyone know something that will? I'm just trying to save myself some typing here.
Not the prettiest thing in the world, but you can use a regular expression to group the content you want to pick out, and then refer to the BASH_REMATCH array:
if [[ $f =~ ^source/(.*).c$ ]] ; then f="obj/${BASH_REMATCH[1]}.o"; fi
you shouldn't have to worry about your code being "wordier" or not. In fact, being a bit verbose is no harm, consider how much it will improve your(or someone else) understanding of the script. Besides, for performance, using bash's internal string manipulation is much faster than calling external commands. Lastly, you are not going to retype your commands every time you use it right? So why worry that its "wordier" since these commands are already in your script?
Not directly in bash. You can use sed, of course:
b="$(sed 's|^source/(.*).c$|obj/$1.obj|' <<< "$f")"
Why not simply using cd to remove the "source/" part?
This way we can avoid the temporary variables a and b:
for f in $(cd source; printf "%s\n" *.c); do
echo process "source/${f}" "obj/${f%.*}.obj"
done

Resources