Rename multiple datetime files in Unix by inserting - and _ characters - bash

I have many files in a directory that I want to rename so that they are recognizable according to a certain convention:
SURFACE_OBS:2019062200
SURFACE_OBS:2019062206
SURFACE_OBS:2019062212
SURFACE_OBS:2019062218
SURFACE_OBS:2019062300
etc.
How can I rename them in UNIX to be as follows?
SURFACE_OBS:2019-06-22_00
SURFACE_OBS:2019-06-22_06
SURFACE_OBS:2019-06-22_12
SURFACE_OBS:2019-06-22_18
SURFACE_OBS:2019-06-23_00

A bash shell loop using mv and parameter expansion could do it:
for file in *:[[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]]
do
prefix=${file%:*}
suffix=${file#*:}
mv -- "${file}" "${prefix}:${suffix:0:4}-${suffix:4:2}-${suffix:6:2}_${suffix:8:2}"
done
This loop picks up every file that matches the pattern:
* -- anything
: -- a colon
[[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]] -- 10 digits
... and then renames it by inserting dashes and and underscore in the desired locations.
I've chosen the wildcard for the loop carefully so that it tries to match the "input" files and not the renamed files. Adjust the pattern as needed if your actual filenames have edge cases that cause the wildcard to fail (and thus rename the files a second time).

#!/bin/bash
strindex() {
# get position of character in string
x="${1%%"$2"*}"
[[ "$x" = "$1" ]] && echo -1 || echo "${#x}"
}
get_new_filename() {
# change filenames like: SURFACE_OBS:2019062218
# into filenames like: SURFACE_OBS:2019-06-22_18
src_str="${1}"
# add last underscore 2 characters from end of string
final_underscore_pos=${#src_str}-2
src_str="${src_str:0:final_underscore_pos}_${src_str:final_underscore_pos}"
# get position of colon in string
colon_pos=$(strindex "${src_str}" ":")
# get dash locations relative to colon position
y_dash_pos=${colon_pos}+5
m_dash_pos=${colon_pos}+8
# now add dashes in date
src_str="${src_str:0:y_dash_pos}-${src_str:y_dash_pos}"
src_str="${src_str:0:m_dash_pos}-${src_str:m_dash_pos}"
echo "${src_str}"
}
# accept path as argument or default to /tmp/baz/data
target_dir="${1:-/tmp/baz/data}"
while read -r line ; do
# since file renaming depends on position of colon extract
# base filename without path in case path has colons
base_dir=${line%/*}
filename_to_change=$(basename "${line}")
echo "mv ${line} ${base_dir}/$(get_new_filename "${filename_to_change}")"
# find cmd attempts to exclude files that have already been renamed
done < <(find "${target_dir}" -name 'SURFACE*' -a ! -name '*_[0-9]\{2\}$')

Related

Get only file name in variable in a for loop

I have a for loop that writes text in a file :
for f in $DATA_DIRECTORY
do
echo ' input.'$f '{'
echo ' copy = ${source.copy}"'$f'.CPY"'
echo ' data = ${source.data}"'$f'.CSV"'
echo ' }'
done
But the "f" variable here looks like this :
/path/to/my/file/FILE.TXT
What i want to get is only the name of the file, not the full path and its extension:
FILE
By the way i tried to change my f variable like this so i dont get the extension but it did not work :
{$f%%.*}
You need two lines; chained operators aren't allowed.
f=${f##*/} # Strip the directory
f=${f%%.*} # Strip the extensions
Or, you can use the basename command to strip the directory and one extension (assuming you know what it is) in one line.
f=$(basename "$f" .txt)

Basic string manipulation from filenames in bash

I have a some file names in bash that I have acquired with
$ ones=$(find SRR*pass*1*.fq)
$ echo $ones
SRR6301033_pass_1_trimmed.fq
SRR6301034_pass_1_trimmed.fq
SRR6301037_pass_1_trimmed.fq
...
I then converted into an array so I can iterate over this list and perform some operations with filenames:
# convert to array
$ ones=(${ones// / })
and the iteration:
for i in $ones;
do
fle=$(basename $i)
out=$(echo $fle | grep -Po '(SRR\d*)')
echo "quants/$out.quant"
done
which produces:
quants/SRR6301033
SRR6301034
...
...
SRR6301220
SRR6301221.quant
However I want this:
quants/SRR6301033.quant
quants/SRR6301034.quant
...
...
quants/SRR6301220.quant
quants/SRR6301221.quant
Could somebody explain why what I'm doing doesn't work and how to correct it?
Why do you want this be done this complicated? You can get rid of all the unnecessary roundabouts and just use a for loop and built-in parameter expansion techniques to get this done.
# Initialize an empty indexed array
array=()
# Start a loop over files ending with '.fq' and if there are no such files
# the *.fq would be un-expanded and checking it against '-f' would fail and
# in-turn would cause the loop to break out
for file in *.fq; do
[ -f "$file" ] || continue
# Get the part of filename after the last '/' ( same as basename )
bName="${file##*/}"
# Remove the part after '.' (removing extension)
woExt="${bName%%.*}"
# In the resulting string, remove the part after first '_'
onlyFir="${woExt%%_*}"
# Append the result to the array, prefixing/suffixing strings 'quant'
array+=( quants/"$onlyFir".quant )
done
Now print the array to see the result
for entry in "${array[#]}"; do
printf '%s\n' "$entry"
done
Ways your attempt could fail
With ones=$(find SRR*pass*1*.fq) you are storing the results in a variable and not in an array. A variable has no way to distinguish if the contents are a list or a single string separated by spaces
With echo $ones i.e. an unquoted expansion, the string content is subject to word splitting. You might not see a difference as long as you have filenames with spaces, having one might let you interpret parts of the filename as different files
The part ${ones// / } makes no-sense in converting the string to an array as the attempt to use an unquoted variable $ones itself would be erroneous
for i in $ones; would be error prone for the said reasons above, the filenames with spaces could be interpreted as separated files instead of one.

Can I find similar named files ignoring case, dashes, spaces or other characters?

EDIT 2:
lets say I have 2 directories one contains:
/dir1/Test File Name.txt
/dir1/This is anotherfile.txt
/dir1/And-Another File.txt
Directory 2 looks like:
/dir2/test-File_Name.txt
/dir2/test file_Name.txt
/dir2/This Is another file.txt
/dir2/And another_file.txt
How can I find (or match) files that are named similar, in this example file 1 from dir1 would match with file 1 and 2 on dir2 and so on
Trying to do this in bash. Say I have a file named "Test File 1.txt" I want to find any file that is named similar like:
test-file 1.txt
test file 1.txt
Test-file-1.txt
test-file_1.zip
etc etc
I can ignore case with find ./files/ -maxdepth 1 -iname $FILE but don't know how to ignore all the other characters.
Is there a way I can do this in bash?
EDIT:
Sorry, I forgot to mention that I need to iterate on all files, the file name is not always the same, I just used an example.
so it could be named "Test File 1.txt" or it could also be named something completely different "Something Else.txt"
So I want to look for all similar named files using a complete file name as base, but this file name can be different, hope I make more sense.
If Perl is your option, please try the following:
perl -e '
#files1 = glob "dir1/*";
#files2 = glob "dir2/*";
foreach (#files2) {
$f2 = $_;
s#.*/##; # remove directory name
# s#\..*?$##; # remove extension (wrong)
s#\.[^.]*$##; # remove extension (corrected)
s#[\W_]#[\\W_]?#g; # replace non-alphanumric chars
$pat = $_ . "\\.\\w+\$";
# print $pat, "\n"; # uncomment to see the regex pattern
foreach $f1 (#files1) {
if ($f1 =~ m#/$pat#i) {
print "$f1 <=> $f2\n";
}
}
}'
Output:
dir1/And-Another File.txt <=> dir2/And another_file.txt
dir1/Test File Name.txt <=> dir2/test file_Name.txt
dir1/Test File Name.txt <=> dir2/test-File_Name.txt
dir1/This is anotherfile.txt <=> dir2/This Is another file.txt
[Explanations]
The concept is to generate a regex pattern on the fly from a filename
in one directory and match it with the files in the other directory.
File extension is replaced with a pattern which matches it.
Non-alphanumeric character and underscore are replaced with a pattern
which matches them including the case the character is missing so that
anotherfile and another file match.
i option added to the pattern enables case-insensitive match.
You can see the generated regex by uncommenting the noted line.
The possible problem is we can not generate a pattern which matches with
another file from the filename anotherfile. In other words, the
matching is one-directional. A possible workaround is to neglect non-alphanumeric characters and underscores at all in matching. It may result in unexpected overmatching depending on the word and punctuation. We will need to specifically define the similarity to step further.
[Edit]
In order to get the result back to bash variables, please try:
while read -r -d "" line; do
# do something with the bash variable "line"
echo "$line"
done < <(
perl -e '
#files1 = glob "dir1/*";
#files2 = glob "dir2/*";
foreach (#files2) {
$f2 = $_;
s#.*/##; # remove directory name
# s#\..*?$##; # remove extension (wrong)
s#\.[^.]*$##; # remove extension (corrected)
s#[\W_]#[\\W_]?#g; # replace non-alphanumric chars
$pat = $_ . "\\.\\w+\$";
# print $pat, "\n"; # uncomment to see the regex pattern
foreach $f1 (#files1) {
if ($f1 =~ m#/$pat#i) {
push(#result, "$f1 <=> $f2");
# if you want just the list of filenames, comment out the line above
# and uncomment the line below
#push(#result, $f1, $f2);
}
}
}
print join("\0", #result) . "\0";
')
The results is stored in the bash variable line in line by line.
If you want to tweak the output format, please modify the line push(#result, ...).
[EDIT]
Modified to work with the following filename pairs:
"Sample Filename.txt" <=> "Sample Filename (100).txt"
"Sample.Filename.txt" <=> "Sample Filename.txt"
Here's the updated code:
while read -r -d "" line; do
# do something with the bash variable "line"
echo $line
done < <(
perl -e '
#files1 = glob "dir1/*";
#files2 = glob "dir2/*";
foreach (#files2) {
$f2 = $_;
s#.*/##; # remove directory name
s#\.[^.]*$##; # remove extension
s#\s*\(.*?\)##; # remove parenthesis if any
s#\s*\[.*?\]##; # remove square bracket if any
s#[\W_]#[\\W_]?#g; # replace non-alphanumric chars
$pat = $_ . "\\s?((\\(.*?\\))|(\\[.*?\\]))?" . "\\.\\w+\$";
#print $pat . "\n"; # uncomment to see the regex pattern
foreach $f1 (#files1) {
if ($f1 =~ m#/$pat#i) {
push(#result, "$f1 <=> $f2");
# if you want just the list of filenames, comment out the line above
# and uncomment the line below
#push(#result, $f1, $f2);
}
}
}
print join("\0", #result) . "\0";
')

Split filename and get the element between first and last occurrence of underscore

I am trying to split many folder names in a for loop and extract the element between first and last underscore of filename. Filenames can look like ENCSR000AMA_HepG2_CTCF or ENCSR000ALA_endothelial_cell_of_umbilical_vein_CTCF.
My problem is that folder names differ form each other in the total number of underscores, so I cannot use something like:
IN=$d
folderIN=(${IN//_/ })
tf_name=${folderIN[-1]%/*} #get last element which is the TF name
cell_line=${folderIN[-2]%/*}; #get second last element which is the cell line
dataset_name=${folderIN[0]%/*}; #get first element which is the dataset name
cell_line can be one or more words separated by underscore but it's allways between 1st and last underscore.
Any help?
Just do this in a two step bash parameter expansion ONLY because bash does not support nested parameter expansion unlike zsh or other shells.
"${string%_*}" to strip the everything after the last occurrence of '_' and "${tempString#*_}" to strip everything from beginning to first occurrence of '_'
string="ENCSR000ALA_endothelial_cell_of_umbilical_vein_CTCF"
tempString="${string%_*}"
printf "%s\n" "${tempString#*_}"
endothelial_cell_of_umbilical_vein
Another example,
string="ENCSR000AMA_HepG2_CTCF"
tempString="${string%_*}"
printf "%s\n" "${tempString#*_}"
HepG2
You can modify this logic to apply on each of the file-names in your folder.
Could use regex.
extract_words() {
[[ "$1" =~ ^([^_]+)_(.*)_([^_]+)$ ]] && echo "${BASH_REMATCH[2]}"
}
while read -r from_line
do
extracted=$(extract_words "$from_line")
echo "$from_line" "[$extracted]"
done < list_of_filenames.txt
EDIT: I moved the "extraction" into an alone bash function for reuse and easy modification for more complex cases, like:
extract_words() {
perl -lnE 'say $2 if /^([^_]+)_(.*)_([^_]+)$/' <<< "$1"
}

Rename files in a directory; new name depends on part of old name

I want to rename files in a directory whose names match certain suffixes, namely 810 or 814. I want the new names to include a string selected depending on the suffix (for 814, EB_ENROLL_REQ; for 810, EB_BCHG_REQ).
Examples of the input filenames (all in $source_dir) are:
CCRD_LLX_814_20160218043477.EDI810
CCRD_LLX_814_20160218043407.EDI814
helloworld
CCRD_LLX_814_20160218043487.EDI814
test123
files.txt
CCRD_LLX_814_20160218043467.EDI810
I want to read all files in the directory and rename only the files ending with 814 or 810, ignoring the rest.
I tried:
export search_dir=/home/test2
declare -a myArray
myArray[814]=EB_ENROLL_REQ
myArray[810]=EB_BCHG_REQ
for entry in "$search_dir"/*
do
pattern=${entry: -3}
#if ??
mv "$entry" "$entry.XHS.JOBRUNID.${myArray[$pattern]}.$entry.XHE"
done
but didn't get what I need.
The output filename for an 814 file should be, for example:
CCRD_LLX_814_20160218043487.EDI814.XHS.JOBRUNID.EB_ENROLL_REQ.CCRD_LLX_814_20160218043487.EDI814.XHE
Try this:
declare -A myArray # -A, not -a --- index by strings
myArray["814"]=EB_ENROLL_REQ # string suffixes, not numeric
myArray["810"]=EB_BCHG_REQ
cd "$search_dir" # Otherwise you have to strip $search_dir out of $entry
for entry in *81[04] # Only work on the files that end with 810 or 814
do
pattern=${entry: -3} # the string "810" or "814"
mv "$entry" "$entry.XHS.JOBRUNID.${myArray[$pattern]}.$entry.XHE"
done

Resources