Rename files matching pattern in a loop - Bash - bash

I have been trying to rename some specific files based on a table but with no success. It either renames all files or gives error.
The directory contains hundreds of files named with long barcodes and I want to rename only files containing the patter _1_.
Example
barcode_1_barcode_SL484171.fastq.gz barcode_2_barcode_SL484171.fastq.gz barcode_1_barcode_SL484370.fastq.gz barcode_2_barcode_SL484370.fastq.gz
mytable.txt
oldname
newname
barcode_1_barcode_SL484171
Description1
barcode_2_barcode_SL484171
Description1
barcode_1_barcode_SL484370
Description2
barcode_2_barcode_SL484370
Description2
Desire output:
Description1.R1.fastq.gz Description2.R1.fastq.gz
As you can see in the table there are two files per description but I only want to rename the ones with the _1_ pattern.
Code I have tried:
for i in *_1_*.fastq.gz; do read oldname newname; mv "$oldname" "$newname".R1.fastq.gz; done < mytable.txt
for i in $(grep '_1_' mytable.txt); do read -r oldname newname; mv ${oldname} ${newname}.R1.fastq.gz; done < mytable.txt
for i in $(grep '_1_' mytable.txt); do oldname=$(cut -f1 $i);newname=$(cut -f2 $i); ln -s ${oldname} ${newname}.R1.fastq.gz; done

while read -r oldname newname
do
if [[ $oldname =~ "_1_" ]]
then
mv $oldname $newname
fi
done < mytable.txt

Something like this.
#!/usr/bin/env bash
while IFS= read -r files; do ##: loop through the output of `grep 'barcode_1_barcode.*' table.txt`
while read -ru9 old_name prefix; do ##: loop through the output of `find . -name 'barcode_1_barcode*.gz' | grep -f <(cut -d' ' -f1 table.txt`
if [[ $files == *"$old_name"* ]]; then ##: If the filename from the output of find matches the first field of table.txt (space delimite)
old_filename="${files%.fastq.gz}" ##: Extract the filename without the fast.gz extesntion
extension="${files#"$old_filename"}" ##: Extract the extention .fast.gz without the filename
# mv -v "$files" "$prefix.R1${extension}"
printf '%s %s %s ==> %s\n' mv -v "$files" "$prefix.R1${extension}" ##: Rename the files to the desired output
fi
done 9< <(grep 'barcode_1_barcode.*' table.txt)
done < <(find . -name 'barcode_1_barcode*.gz' | grep -f <(cut -d' ' -f1 table.txt) ) ##: Remain the first column/field of table.txt
Output from the OP's sample data/files.
renamed './barcode_1_barcode_SL484370.fastq.gz' -> 'Description2.R1.fastq.gz'
renamed './barcode_1_barcode_SL484171.fastq.gz' -> 'Description1.R1.fastq.gz'
If you're satisfied with the output either move the # from the front of mv to the
front of printf or just delete the entire line with printf and remove the # from
mv in order for mv to actually rename the files.

Related

How can i sort a Array based on a not integer Substring in Bash?

I wrote a cleanup Script to delete some certain files. The files are stored in Subfolders. I use find to get those files into a Array and its recursive because of find. So an Array entry could look like this:
(path to File)
./2021_11_08_17_28_45_1733556/2021_11_12_04_15_51_1733556_0.jfr
As you can see the filenames are Timestamps. Find sorts by the Folder name only (./2021_11_08_17_28_45_1733556) but I need to sort all Files which can be in different Folders by the timestamp only of the files and not of the folders (they can be completely ignored), so I can delete the oldest files first. Here you can find my Script at the not properly working state, I need to add some sorting to fix my problems.
Any Ideas?
#!/bin/bash
# handle -h (help)
if [[ "$1" == "-h" || "$1" == "" ]]; then
echo -e '-p [Pfad zum Zielordner] \n-f [Anzahl der Files welche noch im Ordner vorhanden sein sollen] \n-d [false um dryRun zu deaktivieren]'
exit 0
fi
# handle parameters
while getopts p:f:d: flag
do
case "${flag}" in
p) pathToFolder=${OPTARG};;
f) maxFiles=${OPTARG};;
d) dryRun=${OPTARG};;
*) echo -e '-p [Pfad zum Zielordner] \n-f [Anzahl der Files welche noch im Ordner vorhanden sein sollen] \n-d [false um dryRun zu deaktivieren]'
esac
done
if [[ -z $dryRun ]]; then
dryRun=true
fi
# fill array specified by .jfr files an sorted that the oldest files get deleted first
fillarray() {
files=($(find -name "*.jfr" -type f))
totalFiles=${#files[#]}
}
# Return size of file
getfilesize() {
filesize=$(du -k "$1" | cut -f1)
}
count=0
checkfiles() {
# Check if File matches the maxFiles parameter
if [[ ${#files[#]} -gt $maxFiles ]]; then
# Check if dryRun is enabled
if [[ $dryRun == "false" ]]; then
echo "msg=\"Removal result\", result=true, file=$(realpath $1) filesize=$(getfilesize $1), reason=\"outside max file boundary\""
((count++))
rm $1
else
((count++))
echo msg="\"Removal result\", result=true, file=$(realpath $1 ) filesize=$(getfilesize $1), reason=\"outside max file boundary\""
fi
# Remove the file from the files array
files=(${files[#]/$1})
else
echo msg="\"Removal result\", result=false, file=$( realpath $1), reason=\"within max file boundary\""
fi
}
# Scan for empty files
scanfornullfiles() {
for file in "${files[#]}"
do
filesize=$(! getfilesize $file)
if [[ $filesize == 0 ]]; then
files=(${files[#]/$file})
echo msg="\"Removal result\", result=false, file=$(realpath $file), reason=\"empty file\""
fi
done
}
echo msg="jfrcleanup.sh started", maxFiles=$maxFiles, dryRun=$dryRun, directory=$pathToFolder
{
cd $pathToFolder > /dev/null 2>&1
} || {
echo msg="no permission in directory"
echo msg="jfrcleanup.sh stopped"
exit 0
}
fillarray #> /dev/null 2>&1
scanfornullfiles
for file in "${files[#]}"
do
checkfiles $file
done
echo msg="\"jfrcleanup.sh finished\", totalFileCount=$totalFiles filesRemoved=$count"
Assuming the file paths do not contain newline characters, would tou please try
the following Schwartzian transform method:
#!/bin/bash
pat="/([0-9]{4}(_[0-9]{2}){5})[^/]*\.jfr$"
while IFS= read -r -d "" path; do
if [[ $path =~ $pat ]]; then
printf "%s\t%s\n" "${BASH_REMATCH[1]}" "$path"
fi
done < <(find . -type f -name "*.jfr" -print0) | sort -k1,1 | head -n 1 | cut -f2- | tr "\n" "\0" | xargs -0 echo rm
The string pat is a regex pattern to extract the timestamp from the
filename such as 2021_11_12_04_15_51.
Then the timestamp is prepended to the filename delimited by a tab
character.
The output lines are sorted by the timestamp in ascending order
(oldest first).
head -n 1 picks the oldest line. If you want to change the number of files
to remove, modify the number to the -n option.
cut -f2- drops the timestamp to retrieve the filename.
tr "\n" "\0" protects the filenames which contain whitespaces or
tab characters.
xargs -0 echo rm just outputs the command lines as a dry run.
If the output looks good, drop echo.
If you have GNU find, and pathnames don't contain new-line ('\n') and tab ('\t') characters, the output of this command will be ordered by basenames:
find path/to/dir -type f -printf '%f\t%p\n' | sort | cut -f2-
TL;DR but Since you're using find and if it supports the -printf flag/option something like.
find . -type f -name '*.jfr' -printf '%f/%h/%f\n' | sort -k1 -n | cut -d '/' -f2-
Otherwise a while read loop with another -printf option.
#!/usr/bin/env bash
while IFS='/' read -rd '' time file; do
printf '%s\n' "$file"
done < <(find . -type f -name '*.jfr' -printf '%T#/%p\0' | sort -zn)
That is -printf from find and the -z flag from sort is a GNU extension.
Saving the file names you could change
printf '%s\n' "$file"
To something like, which is an array named files
files+=("$file")
Then "${files[#]}" has the file names as elements.
The last code with a while read loop does not depend on the file names but the time stamp from GNU find.
I solved the problem! I sort the array with the following so the oldest files will be deleted first:
files=($(printf '%s\n' "${files[#]}" | sort -t/ -k3))
Link to Solution

Shell: Add string to the end of each line, which match the pattern. Filenames are given in another file

I'm still new to the shell and need some help.
I have a file stapel_old.
Also I have in the same directory files like english_old_sync, math_old_sync and vocabulary_old_sync.
The content of stapel_old is:
english
math
vocabulary
The content of e.g. english is:
basic_grammar.md
spelling.md
orthography.md
I want to manipulate all files which are given in stapel_old like in this example:
take the first line of stapel_old 'english', (after that math, and so on)
convert in this case english to english_old_sync, (or after that what is given in second line, e.g. math to math_old_sync)
search in english_old_sync line by line for the pattern '.md'
And append to each line after .md :::#a1
The result should be e.g. of english_old_sync:
basic_grammar.md:::#a1
spelling.md:::#a1
orthography.md:::#a1
of math_old_sync:
geometry.md:::#a1
fractions.md:::#a1
and so on. stapel_old should stay unchanged.
How can I realize that?
I tried with sed -n, while loop (while read -r line), and I'm feeling it's somehow the right way - but I still get errors and not the expected result after 4 hours inspecting and reading.
Thank you!
EDIT
Here is the working code (The files are stored in folder 'olddata'):
clear
echo -e "$(tput setaf 1)$(tput setab 7)Learning directories:$(tput sgr 0)\n"
# put here directories which should not become flashcards, command: | grep -v 'name_of_directory_which_not_to_learn1' | grep -v 'directory2'
ls ../ | grep -v 00_gliederungsverweise | grep -v 0_weiter | grep -v bibliothek | grep -v notizen | grep -v Obsidian | grep -v z_nicht_uni | tee olddata/stapel_old
# count folders
echo -ne "\nHow much different folders: " && wc -l olddata/stapel_old | cut -d' ' -f1 | tee -a olddata/stapel_old
echo -e "Are this learning directories correct? [j ODER y]--> yes; [Other]-->no\n"
read lernvz_korrekt
if [ "$lernvz_korrekt" = j ] || [ "$lernvz_korrekt" = y ];
then
read -n 1 -s -r -p "Learning directories correct. Press any key to continue..."
else
read -n 1 -s -r -p "Learning directories not correct, please change in line 4. Press any key to continue..."
exit
fi
echo -e "\n_____________________________\n$(tput setaf 6)$(tput setab 5)Found cards:$(tput sgr 0)$(tput setaf 6)\n"
#GET && WRITE FOLDER NAMES into olddata/stapel_old
anzahl_zeilen=$(cat olddata/stapel_old |& tail -1)
#GET NAMES of .md files of every stapel and write All to 'stapelname'_old_sync
i=0
name="var_$i"
for (( num=1; num <= $anzahl_zeilen; num++ ))
do
i="$((i + 1))"
name="var_$i"
name=$(cat olddata/stapel_old | sed -n "$num"p)
find ../$name/ -name '*.md' | grep -v trash | grep -v Obsidian | rev | cut -d'/' -f1 | rev | tee olddata/$name"_old_sync"
done
(tput sgr 0)
I tried to add:
input="olddata/stapel_old"
while IFS= read -r line
do
sed -n "$line"p olddata/stapel_old
done < "$input"
The code to change only the english_old_sync is:
lines=$(wc -l olddata/english_old_sync | cut -d' ' -f1)
for ((num=1; num <= $lines; num++))
do
content=$(sed -n "$num"p olddata/english_old_sync)
sed -i "s/"$content"/""$content":::#a1/g"" olddata/english_old_sync
done
So now, this need to be a inner for-loop, of a outer for-loop which holds the variable for english, right?
stapel_old should stay unchanged.
You could try a while + read loop and embed sed inside the loop.
#!/usr/bin/env bash
while IFS= read -r files; do
echo cp -v "$files" "${files}_old_sync" &&
echo sed '/^.*\.md$/s/$/:::#a1/' "${files}_old_sync"
done < olddata/staple_old
convert in this case english to english_old_sync, (or after that what is given in second line, e.g. math to math_old_sync)
cp copies the file with a new name, if the goal is renaming the original file name from the content of the file staple_old then change cp to mv
The -n and -i flag from sed was ommited , include it, if needed.
The script also assumes that there are no empty/blank lines in the content of staple_old file. If in case there are/is add an addition test after the line where the do is.
[[ -n $files ]] || continue
It also assumes that the content of staple_old are existing files. Just in case add an additional test.
[[ -e $files ]] || { printf >&2 '%s no such file or directory.\n' "$files"; continue; }
Or an if statement.
if [[ ! -e $files ]]; then
printf >&2 '%s no such file or directory\n' "$files"
continue
fi
See also help test
See also help continue
Combining them all together should be something like:
#!/usr/bin/env bash
while IFS= read -r files; do
[[ -n $files ]] || continue
[[ -e $files ]] || {
printf >&2 '%s no such file or directory.\n' "$files"
continue
}
echo cp -v "$files" "${files}_old_sync" &&
echo sed '/^.*\.md$/s/$/:::#a1/' "${files}_old_sync"
done < olddata/staple_old
Remove the echo's If you're satisfied with the output so the script could copy/rename and edit the files.

Automatically rename fasta files with the ID of the first sequence in each file

I have multiple fasta files with single sequence in the same directory. I want to rename each fasta file with the header of the single sequence present in the fasta file. When i run my code , i obtain "Substitution pattern not terminated at (user-supplied code)"
my code:
#!/bin/bash
for i in /home/maryem/files/;
do
if [ ! -f $i ]; then
echo "skipping $i";
else
newname=`head -1 $i | sed 's/^\s*\([a-zA-Z0-9]\+\).*$/\1/'`;
[ -n "$newname" ] ;
mv -i $i $newname.fasta || echo "error at: $i";
fi;
done | rename s/ // *.fasta
fasta file:
>NC_013361.1 Escherichia coli O26:H11 str. 11368 DNA, complete genome
AGCTTTTCATTCTGACTGCAATGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTCTGACAGCAGCTTCTGAACTG
GTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGAC
AGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTATCACCACCATCACCATTACCACAGGT
I'm not sure if there is another way to rename each file with the ID in the header ??
Given that the ID is the first "word" of the file, you can run the following in the directory containing the fasta files.
for f in *.fasta; do d="$(head -1 "$f" | awk '{print $1}').fasta"; if [ ! -f "$d" ]; then mv "$f" "$d"; else echo "File '$d' already exists! Skiped '$f'"; fi; done
Credit: https://unix.stackexchange.com/a/13161

Replacing the duplicate uuids across multiple files

I am trying to replace the duplicate UUIDs from multiple files in a directory. Even the same file can have duplicate UUIDs.
I am using Unix utilities to solve this.
Till now I have used grep, cut, sort and uniq to find all the duplicate UUIDs across the folder and store it in a file (say duplicate_uuids)
Then I tried sed to replace the UUIDs by looping through the file.
filename="$1"
re="*.java"
while read line; do
uuid=$(uuidgen)
sed -i'.original' -e "s/$line/$uuid/g" *.java
done < "$filename"
As you would expect, I ended up replacing all the duplicate UUIDs with new UUID but still, it is duplicated throughout the file!
Is there any sed trick that can work for me?
There are a bunch of ways this can likely be done. Taking a multi-command approach using a function might give you greater flexibility if you want to customize things later, for example:
#!/bin/bash
checkdupes() {
files="$*"
for f in $files; do
filename="$f"
printf "Searching File: %s\n" "${filename}"
while read -r line; do
arr=( $(grep -n "${line}" "${filename}" | awk 'BEGIN { FS = ":" } ; {print $1" "}') )
for i in "${arr[#]:1}"; do
sed -i '' ''"${i}"'s/'"${line}"'/'"$(uuidgen)"'/g' "${filename}"
printf "Replaced UUID [%s] at line %s, first found on line %s\n" "${line}" "${i}" "${arr[0]}"
done
done< <( sort "${filename}" | uniq -d )
done
}
checkdupes /path/to/*.java
So what this series of commands does is to first sort the duplicates (if any) in whatever file you choose. It takes those duplicates and uses grep and awk to create an array of line numbers which each duplicate is found. Looping through the array (while skipping the first value) will allow the duplicates to be replaced by a new UUID and then re-saving the file.
Using a duplicate list file:
If you want to use a file with a list of dupes to search other files and replace the UUID in each of them that match it's just a matter of changing two lines:
Replace:
for i in "${arr[#]:1}"; do
With:
for i in "${arr[#]}"; do
Replace:
done< <( sort "${filename}" | uniq -d )
With:
done< <( cat /path/to/dupes_list )
NOTE: If you don't want to overwrite the file, then remove sed -i '' at the beginning of the command.
This worked for me:
#!/bin/bash
duplicate_uuid=$1
# store file names in array
find . -name "*.java" > file_names
IFS=$'\n' read -d '' -r -a file_list < file_names
# store file duplicate uuids from file to array
IFS=$'\n' read -d '' -r -a dup_uuids < $duplicate_uuid
# loop through all files
for file in "${file_list[#]}"
do
echo "$file"
# Loop through all repeated uuids
for old_uuid in "${dup_uuids[#]}"
do
START=1
# Get the number of times uuid present in this file
END=$(grep -c $old_uuid $file)
if (( $END > 0 )) ; then
echo " Replacing $old_uuid"
fi
# Loop through them one by one and change the uuid
for (( c=$START; c<=$END; c++ ))
do
uuid=$(uuidgen)
echo " [$c of $END] with $uuid"
sed -i '.original' -e "1,/$old_uuid/s/$old_uuid/$uuid/" $file
done
done
rm $file.original
done
rm file_names

Renaming files using their content

I have several files which all start with this line:
CREATE PROCEDURE **CHANGING_NAME**
I want to be able to pull the name of the procedure and use it to the rename the file. There is content to each file below this first line.
Has anyone done something like this before?
Thanks
Assuming you have all files in one directory :
#!/bin/bash
for i in *.extension :
do
# Assuming 3rd column of the first line is the new name of the file
# And **CHANGING_NAME** doesn't contain any space or meta characters
newname=$(awk 'NR==1 && /PROCEDURE/ {print $3}' "$i")
if [ "$newname" == "" ]; then
echo "There is no PROCEDURE in the first line";
echo "No new name for file $i";
else
mv "$i" "$newname"
fi
done
With a lot of care and pretending that the **CHANGING_NAME** is well-formed:
for file in *.files; do mv -i -- "$file" "$(awk '{print $3; exit}' $file)" ; done
The -i option is to prevent accidental overriding existing files.
This version works with spaces (and many other strange characters except for /):
for file in *.files; do mv -i -- "$file" "$(sed -n '1s/^CREATE\ PROCEDURE\ \(.*\)$/\1/p' $file)"; done
Since I was never great with awk I might suggest:
#! /bin/bash
#
for i in *.extension
do echo $i
newname=$(head -1 "${i}" | cut -d ' ' -f2)
mv -i "${i}" "${newname}"
done
This assumes all files you're looking for have the same extension. If not, and you need the extension, you could use:
#! /bin/bash
#
for i in *
do echo $i
ext="${i##*.}"
newname=$(head -1 "${i}" | cut -d ' ' -f2)
mv -i "${i}" "${newname}"."${ext}"
done
Both assume all the files are in a single directory.
You can try the next:
perl -lanE 'if($.==1&&/PROCEDURE/){close ARGV;say "$ARGV,$F[2]"}' files*
and if satisfied, change it to
perl -lanE 'if($.==1&&/PROCEDURE/){close ARGV;rename $ARGV,$F[2]}' files*
mv myfile `sed '1 s/.*PROCEDURE\s*//' myfile`
(the sed command will delete the text to the left of the word proceeding PROCEDURE regardless of how many spaces on only the first line and print it out the backticks make it execute in place so it is used as the filename to the mv command)
to move them all and add an extension .ext:
ls *.ext | xargs -I {} mv {} `sed '1 s/.*PROCEDURE\s*//' {}`.ext

Resources