UNIX :: Padding for files containing string and multipleNumber

UNIX :: Padding for files containing string and multipleNumber - bash

I have many files not having consistent filenames.
For example
IMG_20200823_1.jpg
IMG_20200823_10.jpg
IMG_20200823_12.jpg
IMG_20200823_9.jpg
I would like to rename all of them and ensure they all follow same naming convention
IMG_20200823_0001.jpg
IMG_20200823_0010.jpg
IMG_20200823_0012.jpg
IMG_20200823_0009.jpg
Found out it's possible to change for file having only a number using below
printf "%04d\n"
However am not able to do with my files considering they mix string + "_" + different numbers.
Could anyone help me ?
Thanks !

With Perl's standalone rename or prename command:
rename -n 's/(\d+)(\.jpg$)/sprintf("%04d%s",$1,$2)/e' *.jpg
Output:
rename(IMG_20200823_10.jpg, IMG_20200823_0010.jpg)
rename(IMG_20200823_12.jpg, IMG_20200823_0012.jpg)
rename(IMG_20200823_1.jpg, IMG_20200823_0001.jpg)
rename(IMG_20200823_9.jpg, IMG_20200823_0009.jpg)
if everything looks fine, remove -n.

With Bash regular expressions:
re='(IMG_[[:digit:]]+)_([[:digit:]]+)'
for f in *.jpg; do
[[ $f =~ $re ]]
mv "$f" "$(printf '%s_%04d.jpg' "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]}")"
done
where BASH_REMATCH is an array containing the capture groups of the regular expression. At index 0 is the whole match; index 1 contains IMG_ and the first group of digits; index 2 contains the second group of digits. The printf command is used to format the second group with zero padding, four digits wide.

Use a regex to extract the relevant sub-strings from the input and then pad it...
For each file.
Extract the prefix, number and suffix from the filename.
Pad the number with zeros.
Create the new filename.
Move files
The following code for bash:
echo 'IMG_20200823_1.jpg
IMG_20200823_10.jpg
IMG_20200823_12.jpg
IMG_20200823_9.jpg' |
while IFS= read -r file; do # foreach file
# Use GNU sed to extract parts on separate lines
tmp=$(<<<"$file" sed 's/\(.*_\)\([0-9]*\)\(\..*\)/\1\n\2\n\3\n/')
# Read the separate parts separated by newlines
{
IFS= read -r prefix
IFS= read -r number
IFS= read -r suffix
} <<<"$tmp"
# create new filename
newfilename="$prefix$(printf "%04d" "$number")$suffix"
# move the files
echo mv "$file" "$newfilename"
done
outputs:
mv IMG_20200823_1.jpg IMG_20200823_0001.jpg
mv IMG_20200823_10.jpg IMG_20200823_0010.jpg
mv IMG_20200823_12.jpg IMG_20200823_0012.jpg
mv IMG_20200823_9.jpg IMG_20200823_0009.jpg

Being puzzled by your hint at printf...
Current folder content:
$ ls -1 IMG_*
IMG_20200823_1.jpg
IMG_20200823_21.jpg
Surely is not a good solution but with printf and sed we can do that:
$ printf "mv %3s_%8s_%d.%3s %3s_%8s_%04d.%3s\n" $(ls -1 IMG_* IMG_* | sed 's/_/ /g; s/\./ /')
mv IMG_20200823_1.jpg IMG_20200823_0001.jpg
mv IMG_20200823_21.jpg IMG_20200823_0021.jpg

Related

How to only concatenate files with same identifier using bash script?

I have a directory with files, some have the same ID, which is given in the first part of the file name before the first underscore (always). e.g.:
S100_R1.txt
S100_R2.txt
S111_1_R1.txt
S111_R1.txt
S111_R2.txt
S333_R1.txt
I want to concatenate those identical IDs (and if possible placing the original files in another dir, e.g. output:
original files (folder)
S100_merged.txt
S111_merged.txt
S333_R1.txt
Small note: I imaging that perhaps a solution would be to place all files which will be processed by the code in a new directory and than in a second step move the files with the appended "merged" back to the original dir or something like this...
I am extremely new to bash scripting, so I really can't produce this code. I am use to R language and I can think how it should be but can't write it.
My pitiful attempt is something like this:
while IFS= read -r -d '' id; do
cat *"$id" > "./${id%.txt}_grouped.txt"
done < <(printf '%s\0' *.txt | cut -zd_ -f1- | sort -uz)
or this:
for ((k=100;k<400;k=k+1));
do
IDList= echo "S${k}_S*.txt" | awk -F'[_.]' '{$1}'
while [ IDList${k} == IDList${k+n} ]; do
cat IDList${k}_S*.txt IDList${k+n}_S*.txt S${k}_S*.txt S${k}_S*.txt >cat/S${k}_merged.txt &;
done
Sometimes there are only one version of the file (e.g. S333_R1.txt) sometime two (S100*), three (S111*) or more of the same.
I am prepared for harsh critique for this question because I am so far from a solution, but if someone would be willing to help me out I would greatly appreciate it!

while read $fil;
do
if [[ "$(find . -maxdepth 1 -name $line"_*.txt" | wc -l)" -gt "1" ]]
then
cat $line_*.txt >> "$line_merged.txt"
fi
done <<< "$(for i in *_*.txt;do echo $i;done | awk -F_ '{ print $1 }')"
Search for files with _.txt and run the output into awk, printing the strings before "_". Run this through a while loop. Check if the number of files for each prefix pattern is greater than 1 using find and if it is, cat the files with that prefix pattern into a merged file.

for id in $(ls | grep -Po '^[^_]+' | uniq) ; do
if [ $(ls ${id}_*.txt 2> /dev/null | wc -l) -gt 1 ] ; then
cat ${id}_*.txt > _${id}_merged.txt
mv ${id}_*.txt folder
fi
done
for f in _*_merged.txt ; do
mv ${f} ${f:1}
done

A plain bash loop with preprocessing:
# first get the list of files
find . -type f |
# then extract the prefix
sed 's#./\([^_]*\)_#\1\t&#' |
# then in a loop merge the files
while IFS=$'\t' read prefix file; do
cat "$file" >> "${prefix}_merged.txt"
done
That script is iterative - one file at a time. To detect if there is one file of specific prefix, we have to look at all files at a time. So first an awk script to join list of filenames with common prefix:
find . -type f | # maybe `sort |` ?
# join filenames with common prefix
awk '{
f=$0; # remember the file path
gsub(/.*\//,"");gsub(/_.*/,""); # extract prefix from filepath and store it in $0
a[$0]=a[$0]" "f # Join path with leading space in associative array indexed with prefix
}
# Output prefix and filanames separated by spaces.
# TBH a tab would be a better separator..
END{for (i in a) print i a[i]}
' |
# Read input separated by spaces into a bash array
while IFS=' ' read -ra files; do
#first array element is the prefix
prefix=${files[0]}
unset files[0]
# rest is the files
case "${#files[#]}" in
0) echo super error; ;;
# one file - preserve the filename
1) cat "${files[#]}" > "$outdir"/"${files[1]}"; ;;
# more files - do a _merged.txt suffix
*) cat "${files[#]}" > "$outdir"/"${prefix}_merged.txt"; ;;
esac
done
Tested on repl.
IDList= echo "S${k}_S*.txt"
Executes the command echo with the environment variable IDList exported and set to empty with one argument equal to S<insert value of k here>_S*.txt.
Filename expansion (ie. * -> list of files) is not executed inside " double quotes.
To assign a result of execution into a variable, use command substitution var=$( something seomthing | seomthing )
IDList${k+n}_S*.txt
The ${var+pattern} is a variable expansion that does not add two variables together. It uses pattern when var is set and does nothing when var is unset. See shell parameter expansion and this my answer on ${var-pattern}, but it's similar.
To add two numbers use arithemtic expansion $((k + n)).
awk -F'[_.]' '{$1}'
$1 is just invalid here. To print a line, print it {print %1}.
Remember to check your scripts with http://shellcheck.net

A pure bash way below. It uses only globs (no need for external commands like ls or find for this question) to enumerate filenames and an associative array (which is supported by bash since the version 4.0) in order to compute frequencies of ids. Parsing ls output to list files is questionable in bash. You may consider reading ParsingLs.
#!/bin/bash
backupdir=original_files # The directory to move the original files
declare -A count # Associative array to hold id counts
# If it is assumed that the backup directory exists prior to call, then
# drop the line below
mkdir "$backupdir" || exit
for file in [^_]*_*; do ((++count[${file%%_*}])); done
for id in "${!count[#]}"; do
if ((count[$id] > 1)); then
mv "$id"_* "$backupdir"
cat "$backupdir/$id"_* > "$id"_merged.txt
fi
done

Update numbers in filenames

I have a set of filenames which are ordered numerically like:
13B12363_1B1_0.png
13B12363_1B1_1.png
13B12363_1B1_2.png
13B12363_1B1_3.png
13B12363_1B1_4.png
13B12363_1B1_5.png
13B12363_1B1_6.png
13B12363_1B1_7.png
13B12363_1B1_8.png
13B12363_1B1_9.png
13B12363_1B1_10.png
[...]
13B12363_1B1_495.png
13B12363_1B1_496.png
13B12363_1B1_497.png
13B12363_1B1_498.png
13B12363_1B1_499.png
After some postprocessing, I removed some files and I would like to update the ordering number and replace the actual number by its new position. Looking at this previous question I end up doing something like:
(1) ls -v | cat -n | while read n f; do mv -i $f ${f%%[0-9]+.png}_$n.png; done
However, this command do not recognize the "ordering number + png" and just append the new number at the end of the filename. Something like 13B12363_1B1_10.png_9.png
On the other hand, if I do:
(2) ls -v * | cat -n | while read n f; do mv $f ${f%.*}_$n.png; done
The ordering number is added without issues. Like 13B12363_1B1_10_9.png
So, for (1) it seems I am not specifying the digit correctly but I am not able to find the correct syntax. So far I tried [0-9], [0-9]+, [[:digits:]] and [[:digits:]]+. Which should be the proper one?
Additionally, in (2) I am wondering how I should specify rename (CentOS version) to remove the numbers between the second and the third underscore. Here I have to say that I have some filenames like 20B12363_22_10_9.png, so I should somehow specify second and third underscore.

Using Bash's built-in Basic Regex Engine and a null delimited list of files.
Tested with sample
#!/usr/bin/env bash
prename=$1
# Bash setting to return empty result if no match found
shopt -s nullglob
# Create a temporary directory to prevent file rename collisions
tmpdir=$(mktemp -d) || exit 1
# Add a trap to remove the temporary directory on EXIT
trap 'rmdir -- "$tmpdir"' EXIT
# Initialize file counter
n=0
# Generate null delimited list of files
printf -- %s\\0 "${prename}_"*'.png' |
# Sort the null delimited list on 3rd field numeric order with _ separator
sort --zero-terminated --field-separator=_ --key=3n |
# Iterate the null delimited list
while IFS= read -r -d '' f; do
# If Bash Regex match the file name AND
# file has a different sequence number
if [[ "$f" =~ (.*)_([0-9]+)\.png$ ]] && [[ ${BASH_REMATCH[2]} -ne $n ]]; then
# Use captured Regex match group 1 to rename file with incrementing counter
# and move it to the temporary folder to prevent rename collision with
# existing file
echo mv -- "$f" "$tmpdir/${BASH_REMATCH[1]}_$((n)).png"
fi
# Increment file counter
n=$((n+1))
done
# Move back the renamed files in place
mv --no-clobber -- "$tmpdir/*" ./
# $tempdir removal is automatic on EXIT
# If something goes wrong, some files remain in it and it is not deleted
# so these can be dealt with manually
Remove the echo if the result matches your expectations.
Output from the sample
mv -- 13B12363_1B1_495.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_11.png
mv -- 13B12363_1B1_496.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_12.png
mv -- 13B12363_1B1_497.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_13.png
mv -- 13B12363_1B1_498.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_14.png
mv -- 13B12363_1B1_499.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_15.png

Do not parse ls.
read interprets \ and splits on IFS. bashfaq how to read a stream line by line
In ${f%%replacement} expansion the replacement is not regex, but globulation. Rules differ. + means literally +.
You could shopt -o extglob and then ${f%%+([0-9]).png}. Or write a loop. Or match the _ too and do f=${f%%.png}; f="${f%_[0-9]*}_".
Or something along (untested):
find . -maxdepth 1 -mindepth 1 -type f -name '13B12363_1B1_*.png' |
sort -t_ -n -k3 |
sed 's/\(.*\)[0-9]+\.png$/&\t\1/' |
{
n=1;
while IFS=$'\t' read -r from to; do
echo mv "$from" "$to$((n++)).png";
done;
}

Another alternative, with perl:
perl -e 'while(<#ARGV>){$o=$_;s/\d+(?=\D*$)/$i++.".renamed"/e;die if -e $_;rename $o,$_}while(<*.renamed>){$o=$_;s/\.renamed$//;die if -e $_;rename $o,$_}' $(ls -v|sed -E "s/$|^/'/g"|paste -sd ' ' -)
This solution should avoid rename collisions by: first renaming files adding extra ".renamed" extension. And then removing the ".renamed" extension as the last step. Also, There are checks to detect rename collision.
Anyways, please backup your data before trying :)
The perl script unrolled and explained:
while(<#ARGV>){ # loop through arguments.
# filenames are passed to "$_" variable
# save old file name
$o=$_;
# if not using variable, regex replacement (s///) uses topic variable ($_)
# e flag ==> evals the replacement
s/\d+(?=\D*$)/$i++.".renamed"/e; # works on $_
# Detect rename collision
die if -e $_;
rename $o,$_
}
while(<*.renamed>){
$o=$_;
s/\.renamed$//; # remove .renamed extension
die if -e $_;
rename $o,$_
}
The regex:
\d+ # one number or more
(?=\D*$) # followed by 0 or more non-numbers and end of string

Pipe the output of basename to string substitution

I need the basename of a file that is given as an argument to a bash script. The basename should be stripped of its file extension.
Let's assume $1 = "/somefolder/andanotherfolder/myfile.txt", the desired output would be "myfile".
The current attempt creates an intermediate variable that I would like to avoid:
BASE=$(basename "$1")
NOEXT="${BASE%.*}"
My attempt to make this a one-liner would be piping the output of basename. However, I do not know how to pipe stdout to a string substitution.
EDIT: this needs to work for multiple file extensions with possibly differing lengths, hence the string substitution attempt as given above.

Why not Zoidberg ?
Ehhmm.. I meant why not remove the ext before going for basename ?
basename "${1%.*}"
Unless of course you have directory paths with dots, then you'll have to use basename before and remove the extension later:
echo $(basename "$1") | awk 'BEGIN { FS = "." }; { print $1 }'
The awk solution will remove anything after the first dot from the filename.
There's a regular expression based solution which uses sed to remove only the extension after last dot if it exists:
echo $(basename "$1") | sed 's/\(.*\)\..*/\1/'
This could even be improved if you're sure that you've got alphanumeric extensions of 3-4 characters (eg: mp3, mpeg, jpg, txt, json...)
echo $(basename "$1") | sed 's/\(.*\)\.[[:alnum:]]\{3\}$/\1/'

How about this?
NEXT="$(basename -- "${1%.*}")"
Testing:
set -- '/somefolder/andanotherfolder/myfile.txt'
NEXT="$(basename -- "${1%.*}")"
echo "$NEXT"
myfile
Alternatively:
set -- "${1%.*}"; NEXT="${1##*/}"

NOEXT="${1##*/}"; NOEXT="${NOEXT%.*}"

How about:
$ [[ $var =~ [^/]*$ ]] && echo ${BASH_REMATCH%.*}
myfile

How can I save only a substring of file names from a directory without the file extension?

I have a directory that I'm reading from and I want to save only the date representation as a string.
I am close to getting it , although I know there is probably an easier way. Here is what I have so far:
#files are in the format of "THIS_20200420.csv" so I want only "20200420"
declare -a arr
declare -a arr2
FILES=test2/*.csv
for file in $FILES
do
arr=(${arr[*]} "${file##*/}")
done
for i in "${arr[#]}"
do
arr2+=$(echo $i | cut -c6-13)
done
for item in "${arr2[#]}"
do
echo $item
done
the output shows the array only having one element which is all the strings concatenated:
20200110202001202020021920200220202004202020042220200110202001202020021920200220202004202020042220200219202002202020042020200422
Im bashing my head against my computer at this point.

arr=(
"THIS_20200420.csv"
"THIS_20200421.csv"
"THIS_20200422.csv"
"THIS_20200423.csv"
"THIS_20200424.csv"
"THIS_20200425.csv"
"THIS_20200426.csv"
"THIS_20200427.csv"
"THIS_20200428.csv"
"THIS_20200429.csv"
"THIS_20200430.csv" )
arr=( ${arr[#]//*_} )
arr=( ${arr[#]//.*} )
echo "arr: ${arr[#]}"
Explanation:
arr=( ${arr[#]//*_} ) will match all char up to '_' for each element, and replace them with empty string.
arr=( ${arr[#]//.*} ) will match all char after '.' for each element, and replace them with empty string.
For more information on parameter expansion, a good reference is TLDP's guide on parameter expansion.

Try this
declare -a arrayname=($(ls -1 test2/*.csv | grep -o '[0-9]*'))
Demo:
$ls -1 *csv
THIS_20200420.csv
THIS_20200421.csv
THIS_20200422.csv
THIS_20200423.csv
THIS_20200424.csv
THIS_20200425.csv
THIS_20200426.csv
THIS_20200427.csv
THIS_20200428.csv
THIS_20200429.csv
THIS_20200430.csv
$declare -a arrayname=($(ls -1 *csv | grep -o '[0-9]*'))
$echo ${arrayname[#]}
20200420 20200421 20200422 20200423 20200424 20200425 20200426 20200427 20200428 20200429 20200430
$echo ${arrayname[2]}
20200422
$

You could achieve this using a loop with awk:
$ for file in *.csv; do echo $file | awk -F '[^[:alnum:]]' '{print $2}'; done
The -F '[^[:alnum:]]' tells awk to use non alphanumeric characters as the delimiter.
Another way to do this is to use bash shell parameter expansion to echo only the part of the filename you want. This obviously only works if your filenames have consistent formatting:
$ for file in *.csv; do echo "${file:5:8}"; done
I thought it would be nice to use bash parameter expansion to strip the unwanted prefix and suffix but you can't have nested expansion (afaict) so this is the best I could come up with:
$ for file in *.csv; do echo "$(tmp=${file%.csv}; echo ${tmp#THIS_})"; done

Meet Cut! A good friend of Linux Users
for file in ./*.csv; do echo $file | cut -d "_" -f 2 | cut -d "." -f 1 ; done
This one line should do the trick!
Example:

Use an array for the files assignment and parameter expansion.
#!/usr/bin/env bash
shopt -s nullglob
##: Save the files ending in *.csv in an array
## so it expands properly, variable assignment does not expand the glob *
files=(test2/*.csv)
##: Remain only the files that end with .csv without the pathname, longest match
files=("${files[#]##*/}")
##: Remain only the file names without the .csv extention
files=("${files[#]%.csv}")
##: Remain only the filename after the _ from the beginning, shortest match.
files=("${files[#]#*_}")
printf '%s ' "${files[#]}"

Batch rename multiple numbers in filename with different padding

I am trying to rename a batch of files of the form:
test1_run1
test1_run2
...
test1_run10
...
test10_run1
test10_run2
...
test10_run10
to the form with multiple paddings. For the first number I need padding with zeros of length 5 and for the second with length 3.
The final result should be of the form:
test00001_run001
test00001_run002
...
test00001_run010
...
test00010_run001
test00010_run002
...
test00010_run010
How can I do this in bash for all the files in a particular folder?

We can convert the string into test + 5 digits + _run + 3 digits formats by saying:
$ awk -F"test" '{split($2,a,"_run"); printf "%s%0.5d%s%0.3d\n", FS, a[1], "_run", a[2]}' a
test00001_run001
test00001_run002
test00001_run010
test00010_run001
test00010_run002
test00010_run010
This works by using test as field separator and splitting the 2nd field in two parts: before and after _run. Then, it uses the printf thingies to get the proper output.
Then, you can print mv together with the previous value and say:
$ awk -F"test" '{split($2,a,"_run"); printf "mv %s %s%0.5d%s%0.3d\n", $0, FS, a[1], "_run", a[2]}' a
mv test1_run1 test00001_run001
mv test1_run2 test00001_run002
mv test1_run10 test00001_run010
mv test10_run1 test00010_run001
mv test10_run2 test00010_run002
mv test10_run10 test00010_run010
If you then pipe it to sh, it will get executed.

If you don't want to use perl or awk, and strictly use bash and some utility programs that are available in most distribution, you can try something like this:
for i in * ; do
testpart=`echo $i | cut -d_ -f1`
testnum=${testpart#test}
runpart=`echo $i | cut -d_ -f2`
runnum=${runpart#run}
destfile=test`printf %05d $testnum`_run`printf %03d $runnum`
mv $i $destfile
done

In bash:
#!/bin/bash
shopt -s nullglob extglob
for file in test+([[:digit:]])_run+([[:digit:]]); do
[[ $file =~ ^test([[:digit:]]+)_run([[:digit:]]+)$ ]]
printf -v newfile 'test_%05d_run%03d' "$((10#${BASH_REMATCH[1]}))" "$((10#${BASH_REMATCH[2]}))"
echo mv "$file" "$newfile"
done
Run this from within the folder you want to process. This will only echo the mv commands to be performed. Remove the echo if you're happy with the result.
we're using the shell option nullglob so that non-matching globs expand to nothing;
we're using the shell option extglob because the for loop will use extended globs;
the extended glob test+([[:digit:]])_run+([[:digit:]]) will expand to the files matching this pattern (if any)
we're using a regex to get the digits from the file names; the first number will be in BASH_REMATCH[1] and the second in BASH_REMATCH[2].
we're using printf to format the new file name; the modifiers %05d and %03d will format the numbers according to your wishes (with appropriate leading zeroes). Observe that we're using ((10#${BASH_REMATCH[1]})) to explicitly specify that the number is in radix 10, in case you have a file test09_run001. The 09 part would make bash misinterpret the number in radix 8 (because of the leading 0) and you'll get a complaint; the -v switch tells printf to not output on standard output, but to store the output in variable newfile;
finally we perform the mv.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

UNIX :: Padding for files containing string and multipleNumber - bash

Related

How to only concatenate files with same identifier using bash script?

Update numbers in filenames

Pipe the output of basename to string substitution

How can I save only a substring of file names from a directory without the file extension?

Batch rename multiple numbers in filename with different padding

Categories

Resources