Trimming filenames in bash with built-in tools only (no rename command) - bash

I have filenames of the form
word-123_AnotherWord--asdf_12345.mp4
word-123_AnotherWord-_asdf-12345.mp4
word-123_AnotherWord-asdf--12345.mp4
word-123_AnotherWord-asdf_-12345.mp4
...which I wish to trim to only contain the last 11 characters and extension.
My current attempt to do so looks like the following:
$ for i in *.mp4 ; do
mv "$i" "${/.*?(.{1,11}\.mp4)$/}";
done
But I gives me this error:
bash: ${/.*?(.{1,11}.mp4)$/}: bad substitution
Any idea why?
This question is a continue to this stack , but answer there works on my PC locally only, I didn't work on my server remotely!
Thanks in advance!

In the syntax ${var/pattern/replacement}, there are several things wrong with the usage "${/.*?(.{1,11}\.mp4)$/}":
First, var isn't optional, it's mandatory.
Second, pattern needs to be given in glob format, not regex format. If you want fancier expressions, use extglob syntax.
Third, unless intent is to delete the parts matching the expression, the final / should actually contain something.
If you want to trim everything but the last 15 characters of each name (11 + 4 for the extension), that's trivial:
for i in *.mp4; do
mv "$i" "${i:${#i}-15}"
done
Now, if you really want to use a regex:
name_re='.{1,11}\.mp4$'
for i in *.mp4; do
[[ $i =~ $name_re ]] && mv -- "$i" "${BASH_REMATCH[0]}"
done

Related

Find a string in a filename with a constant and a regular expression in the middle and replace it

I feel like this is a lame question, but after a lot of attempts, I'm stuck. I have a large number of files like this:
S2EC1_DKDL220005480-2a-AK13554-7UDI265_HHJ2MCCX2_L8_1.fq.gz
S2EC1_DKDL220005480-2a-AK13554-7UDI265_HHJ2MCCX2_L8_2.fq.gz
S2EC2_DKDL220005480-2a-5UDI249-7UDI265_HHJ2MCCX2_L8_1.fq.gz
S2EC2_DKDL220005480-2a-5UDI249-7UDI265_HHJ2MCCX2_L8_2.fq.gz
S2EC11_DKDL220005480-2a-5UDI251-5UDI1063_HHJ2MCCX2_L8_1.fq.gz
S2EC11_DKDL220005480-2a-5UDI251-5UDI1063_HHJ2MCCX2_L8_2.fq.gz
and I'm trying to get them renamed to look like this:
S2EC1_R1.fastq.gz
S2EC1_R2.fastq.gz
S2EC2_R1.fastq.gz
S2EC2_R2.fastq.gz
S2EC11_R1.fastq.gz
S2EC11_R2.fastq.gz
The filenames are variable length. There is a bit that is identical in every filename DKDL220005480-2a- and _HHJ2MCCX2_L8 but it's separated by a bit in the middle that is variable in terms of composition and length.
From my bash shell I can make some progress in a kind of a step-wise fashion by doing this to get rid of the constant text:
for x in *; do mv $x ${x/DKDL220005480-2a-/}; done
for x in *; do mv $x ${x/_HHJ2MCCX2_L8_/_R}; done
Which yields file names like this:
S2EC1_AK13554-7UDI265_R1.fq.gz
S2EC1_AK13554-7UDI265_R2.fq.gz
S2EC2_5UDI249-7UDI265_R1.fq.gz
S2EC2_5UDI249-7UDI265_R2.fq.gz
S2EC11_5UDI251-5UDI1063_R1.fq.gz
S2EC11_5UDI251-5UDI1063_R2.fq.gz
But now I'm failing to find and replace the variable parts in the middle. Of course it would also be much more elegant to do it all in one go.
Here is what I consider my most promising code for matching that variable middle bit:
for x in *; do mv $x ${x/_(.+)_/}; done
But I get this error:
mv: 'S2EC1_AK13554-7UDI265_R1.fq.gz' and 'S2EC1_AK13554-7UDI265_R1.fq.gz' are the same file
mv: 'S2EC1_AK13554-7UDI265_R2.fq.gz' and 'S2EC1_AK13554-7UDI265_R2.fq.gz' are the same file
mv: 'S2EC2_5UDI249-7UDI265_R1.fq.gz' and 'S2EC2_5UDI249-7UDI265_R1.fq.gz' are the same file
mv: 'S2EC2_5UDI249-7UDI265_R2.fq.gz' and 'S2EC2_5UDI249-7UDI265_R2.fq.gz' are the same file
mv: 'S2EC11_5UDI251-5UDI1063_R1.fq.gz' and 'S2EC11_5UDI251-5UDI1063_R1.fq.gz' are the same file
mv: 'S2EC11_5UDI251-5UDI1063_R2.fq.gz' and 'S2EC11_5UDI251-5UDI1063_R2.fq.gz' are the same file
Not sure if it's something wrong with my regular expression or my mv code (or both or even possibly something else, ha ha).
Thanks
Pattern matching and regular expressions are two different things. In pattern matching * means any string. In regular expressions it means zero or more of what precedes. In pattern matching (.+) means... the literal (.+) string. In regular expressions it represents a capture group with at least one character.
For your simple renaming scheme you can try:
for f in *.fq.gz; do
g="${f/_DKDL220005480-2a-*_HHJ2MCCX2_L8_/_R}"
printf 'mv "%s" "%s"\n' "$f" "${g%.fq.gz}.fastq.gz"
# mv "$f" "${g%.fq.gz}.fastq.gz"
done
Once satisfied with the output uncomment the mv line.
To use regular expressions in bash you need to use [[ $x =~ regex ]], and you can use groups with $BASH_REMATCH, so:
for x in *; do
[[ $x =~ ^(S2EC[0-9]+)_.*_([0-9]+).fq.gz$ ]] &&
mv $x ${BASH_REMATCH[1]}_R${BASH_REMATCH[2]}.fastq.gz
done

How do I rename multiple files before the extension in linux?

I want to take a group of files with names like 123456_1_2.mpg and turn it into 123456.mpg how can I do this using terminal commands?
To loop over all the available files you can use a for loop over the file names of the form ??????_?_?.mpg.
To rename the files you can retain the shortest match of a pattern from the beginning of the string using ${MYVAR%%pattern} without using any external command.
This said, your code should look like:
#!/bin/bash
shopt -s nullglob # do nothing if no matches found
for file in ??????_?_?.mpg; do
[[ -f $file ]] || continue # skip if not a regular file
new_file="${file%%_*}.mpg" # compose the new file name
echo mv "$file" "$new_file" # remove echo after testing
done
rename 's/_.*/.mpg/' *mpg
this will remove everything between the first underscore and the mpg file extension for all files ending in mpg
We can use grep to strip out everything but the first sequence of numbers. The --interactive flag will ask you if you're sure for each move, so you can make sure it's not doing anything you don't expect.
for file in *.mpg; do
mv --interactive "$file" "$(grep -o '^[0-9]\+' <<< "$file")".mpg
done
The regex ^[0-9]\+ translates to "any sequence of characters that starts with a number and is followed by zero or more numbers".

Rename all files with the name pattern *.[a-z0-9].bundle.*, to replace the [a-z0-9] with a given string

On building apps with the Angular 2 CLI, I get outputs which are named, for instance:
inline.d41d8cd.bundle.js
main.6d2e2e89.bundle.js
etc.
What I'm looking to do is create a bash script to rename the files, replacing the digits between the first two . with some given generic string. Tried a few things, including sed, but I couldn't get them to work. Can anyone suggest a bash script to get this working?
In pure bash regEx using the =~ variable (supported from bash 3.0 onwards)
#!/bin/bash
string_to_replace_with="sample"
for file in *.js
do
[[ $file =~ \.([[:alnum:]]+).*$ ]] && string="${BASH_REMATCH[1]}"
mv -v "$file" "${file/$string/$string_to_replace_with}"
done
For your given input files, running the script
$ bash script.sh
inline.d41d8cd.bundle.js -> inline.sample.bundle.js
main.6d2e2e89.bundle.js -> main.sample.bundle.js
Short, powerfull and efficient:
Use this (perl) tool. And use Perl Regular Expression:
rename 's/\.\X{4,8}\./.myString./' *.js
or
rename 's/\.\X+\./.myString./' *.js
A pure-bash option:
shopt -s extglob # so *(...) will work
generic_string="foo" # or whatever else you want between the dots
for f in *.bundle.js ; do
mv -vi "$f" "${f/.*([^.])./.${generic_string}.}"
done
The key is the replacement ${f/.*([^.]./.${generic_string}.}. The pattern /.*([^.])./ matches the first occurrence of .<some text>., where <some text> does not include a dot ([^.]) (see the man page). The replacement .${generic_string}. replaces that with whatever generic string you want. Other than that, double-quote in case you have spaces, and there you are!
Edit Thanks to F. Hauri - added -vi to mv. -v = show what is being renamed; -i = prompt before overwrite (man page).

filename comparison with wildcard

I am working on a script and I need to compare a filename to another one and look for specific changes (in this case a "(x)" added to a filename when OS X needs to add a file to a directory, when a filename already exists) so this is an excerpt of the script, modified to be tested without the rest of it.
#!/bin/bash
p2_s2="/Path/to file (name)/containing - many.special chars.docx.gdoc"
next_line="/Path/to file (name)/containing - many.special chars.docx (1).gdoc"
file_ext=$(echo "${p2_s2}" | rev | cut -d '.' -f 1 | rev)
file_name=$(basename "${p2_s2}" ".${file_ext}")
file_dir=$(dirname "${p2_s2}")
esc_file_name=$(printf '%q' "${file_name}")
esc_file_dir=$(printf '%q' "${file_dir}")
esc_next_line=$(printf '%q' "${next_line}")
if [[ ${esc_next_line} =~ ${esc_file_dir}/${esc_file_name}\ \(?\).${file_ext} ]]
then
echo "It's a duplicate!"
fi
What I'm trying to do here is detect if the file next_line is a duplicate of p2_s2. As I am expecting multiple duplicates, next_line can have a (1) appended at the end of a filename or any other number in brackets (Although I am sure no double digits). As I can't do a simple string compare with a wildcard in the middle, I tried using the "=~" operator and escaping all the special chars. Any idea what I'm doing wrong?
You can trim ps2_s2's extension, trim next_line's extension including the number inside the parenthesis and see if you get the same file name. If you do - it's a duplicate. In order to do so, [[ allows us to perform a comparison between a string and a Glob.
I used extglob's +( ... ) pattern, so I can use +([0-9]) to match the number inside the parenthesis. Notice that extglob is enabled by shopt -s extglob.
#!/bin/bash
p2_s2="/Path/to/ps2.docx.gdoc"
next_line="/Path/to/ps2(1).docx.gdoc"
shopt -s extglob
if [[ "${p2_s2%%.*}" = "${next_line%%\(+([0-9])\).*}" ]]; then
printf '%s is a duplicate of %s\n' "$next_line" "$p2_s2"
fi
EDIT:
I now see that you've edited your question, so in case this solution is not enough, I'm positive that it'll be a good template to work with.
The (1) in next_line doesn't come before the final . it comes before the second to final . in the original filename but you only strip off a single . as the extension.
So when you generate the comparison filename you end up with /Path/to\ file\ \(name\)/containing\ -\ many.special\ chars.docx\ \(?\).gdoc which doesn't match what you expect.
If you had added set -x to the top of your script you'd have seen what the shell was actually doing and seen this.
What does OS X actually do in this situation? Does it add (#) before .gdoc? Does it add it before.docx`? Does it depend on whether OS X knows what the filename is (it is some type it can open natively)?

Extracting a string between last two slashes in Bash

I know this can be easily done using regex like I answered on https://stackoverflow.com/a/33379831/3962126, however I need to do this in bash.
So the closest question on Stackoverflow I found is this one bash: extracting last two dirs for a pathname, however the difference is that if
DIRNAME = /a/b/c/d/e
then I need to extract
d
This may be relatively long, but it's also much faster to execute than most preceding answers (other than the zsh-only one and that by j.a.), since it uses only string manipulations built into bash and uses no subshell expansions:
string='/a/b/c/d/e' # initial data
dir=${string%/*} # trim everything past the last /
dir=${dir##*/} # ...then remove everything before the last / remaining
printf '%s\n' "$dir" # demonstrate output
printf is used in the above because echo doesn't work reliably for all values (think about what it would do on a GNU system with /a/b/c/-n/e).
Here a pure bash solution:
[[ $DIRNAME =~ /([^/]+)/[^/]*$ ]] && printf '%s\n' "${BASH_REMATCH[1]}"
Compared to some of the other answers:
It matches the string between the last two slashes. So, for example, it doesn't match d if DIRNAME=d/e.
It's shorter and fast (just uses built-ins and doesn't create subprocesses).
Support any character between last two slashes (see Charles Duffy's answer for more on this).
Also notice that is not the way to assign a variable in bash:
DIRNAME = /a/b/c/d/e
^ ^
Those spaces are wrong, so remove them:
DIRNAME=/a/b/c/d/e
Using awk:
echo "/a/b/c/d/e" | awk -F / '{ print $(NF-1) }' # d
Edit: This does not work when the path contains newlines, and still gives output when there are less than two slashes, see comments below.
Using sed
if you want to get the fourth element
DIRNAME="/a/b/c/d/e"
echo "$DIRNAME" | sed -r 's_^(/[^/]*){3}/([^/]*)/.*$_\2_g'
if you want to get the before last element
DIRNAME="/a/b/c/d/e"
echo "$DIRNAME" | sed -r 's_^.*/([^/]*)/[^/]*$_\1_g'
OMG, maybe this was obvious, but not to me initially. I got the right result with:
dir=$(basename -- "$(dirname -- "$str")")
echo "$dir"
Using zsh parameter substitution is pretty cool too
echo ${${DIRNAME%/*}##*/}
I think it's faster than the double $() as well, because it won't need any subprocesses.
Basically it slices off the right side first, and then all the remaining left side second.

Resources