Removing spec characters with sed and mv but keeping / for destination, using full path - bash

I am trying to remove special characters from specific files in files.txt. I need the mv command to use the full path to write the corrected file to the same location. The source and destination directories both contain spaces.
files.txt
/home/user/scratch/test2 2/capital:lets?.log
/home/user/scratch/test2 2/:31apples.tif
/home/user/scratch/test2 2/??testdoc1.txt
script.sh
#!/bin/bash
set -x
while IFS="" read -r p || [ -n "$p" ]
do
printf '%s\n' "$p"
mv "$p" $(echo "$p" | sed -e 's#[^A-Za-z0-9._-/]#_#g')
done < /home/user/scratch/files.txt
Here is the error that I get:
+ IFS=
+ read -r p
+ printf '%s\n' '/home/user/scratch/test2 2/??testdoc1.txt'
/home/user/scratch/test2 2/??testdoc1.txt
++ sed -e 's#[^A-Za-z0-9._-/]#_#g'
sed: -e expression #1, char 22: Invalid range end
++ echo '/home/user/scratch/test2 2/??testdoc1.txt'
+ mv '/home/user/scratch/test2 2/??testdoc1.txt'
mv: missing destination file operand after '/home/user/scratch/test2 2/??testdoc1.txt'
If I remove the / from sed -e 's#[^A-Za-z0-9._-]#_#g' command it will try to write the file like this:
++ sed -e 's#[^A-Za-z0-9._-]#_#g'
++ echo '/home/user/scratch/test2 2/??testdoc1.txt'
+ mv '/home/user/scratch/test2 2/??testdoc1.txt' _home_user_scratch_test2_2___testdoc1.txt
I have tried changing the delimiter in sed to something other than a / but the issue persists. If I try using mv "$p" "$(echo "$p" | sed -e 's|/[^/]*/\{0,1\}$||;s|^$|/|')" mv errors with this is the same file.
Am I approaching this problem wrong? This feels like it should have been an easier task.
EDIT:
The solution below gives me an issue with the file itself:
' echo '/mnt/data/bucket/Desktop/For_the_New_Director/Part Number Assignment/__Prod_Development/.Memeo 40'\'' flat w:boat plane.xls.plist
/mnt/data/bucket//Desktop/For_the_New_Director/Part Number Assignment/__Prod_Development/.Memeo 40' flat w:boat plane.xls.plist
+ dir='/mnt/data/bucket/Desktop/For_the_New_Director/Part Number Assignment/__Prod_Development'
= */* ]]/data/bucket/Desktop/For_the_New_Director/Part Number Assignment/__Prod_Development/.Memeo 40' flat w:boat plane.xls.plist
' file='.Memeo 40'\'' flat w:boat plane.xls.plist
+ echo .Memeo '40'\''' flat w:boat $'plane.xls.plist\r'
.Memeo 40' flat w:boat plane.xls.plist
+ echo /mnt/data/bucket/Desktop/For_the_New_Director/Part Number Assignment/__Prod_Development
/mnt/data/bucket/Desktop/For_the_New_Director/Part Number Assignment/__Prod_Development
The actual filename is: .Memeo 40' flat w:boat plane.xls.plist
Why is it changing the filename when trying to do the move?

There are two problems in your substitution:
In the character class description [^A-Za-z0-9._-/], the last part
_-/ is interpreted as a range of characters between _ and /,
which is invalid. To avoid this, you need to escape the hyphen character
with a backslash, or put the hyphen at the beginning or the end of the
character class.
The directory name test2 2 includes the special character and
the sed command converts the directory name into test2_2,
which does not exist. Assuming you want to change the filenames only
keeping the directory names as is, we need to process the directory names
and filenames separately.
Then would you please try the following:
set -x
while IFS= read -r p || [ -n "$p" ]; do
echo "$p"
dir=${p%/*} # extract directory name
[[ $p = */* ]] || dir="." # in case $p does not contain "/"
file=${p##*/} # extract filename
mv -- "$p" "$dir/${file//[^-A-Za-z0-9._]/_}"
done < /home/user/scratch/files.txt

Related

Rename files matching pattern in a loop - Bash

I have been trying to rename some specific files based on a table but with no success. It either renames all files or gives error.
The directory contains hundreds of files named with long barcodes and I want to rename only files containing the patter _1_.
Example
barcode_1_barcode_SL484171.fastq.gz barcode_2_barcode_SL484171.fastq.gz barcode_1_barcode_SL484370.fastq.gz barcode_2_barcode_SL484370.fastq.gz
mytable.txt
oldname
newname
barcode_1_barcode_SL484171
Description1
barcode_2_barcode_SL484171
Description1
barcode_1_barcode_SL484370
Description2
barcode_2_barcode_SL484370
Description2
Desire output:
Description1.R1.fastq.gz Description2.R1.fastq.gz
As you can see in the table there are two files per description but I only want to rename the ones with the _1_ pattern.
Code I have tried:
for i in *_1_*.fastq.gz; do read oldname newname; mv "$oldname" "$newname".R1.fastq.gz; done < mytable.txt
for i in $(grep '_1_' mytable.txt); do read -r oldname newname; mv ${oldname} ${newname}.R1.fastq.gz; done < mytable.txt
for i in $(grep '_1_' mytable.txt); do oldname=$(cut -f1 $i);newname=$(cut -f2 $i); ln -s ${oldname} ${newname}.R1.fastq.gz; done
while read -r oldname newname
do
if [[ $oldname =~ "_1_" ]]
then
mv $oldname $newname
fi
done < mytable.txt
Something like this.
#!/usr/bin/env bash
while IFS= read -r files; do ##: loop through the output of `grep 'barcode_1_barcode.*' table.txt`
while read -ru9 old_name prefix; do ##: loop through the output of `find . -name 'barcode_1_barcode*.gz' | grep -f <(cut -d' ' -f1 table.txt`
if [[ $files == *"$old_name"* ]]; then ##: If the filename from the output of find matches the first field of table.txt (space delimite)
old_filename="${files%.fastq.gz}" ##: Extract the filename without the fast.gz extesntion
extension="${files#"$old_filename"}" ##: Extract the extention .fast.gz without the filename
# mv -v "$files" "$prefix.R1${extension}"
printf '%s %s %s ==> %s\n' mv -v "$files" "$prefix.R1${extension}" ##: Rename the files to the desired output
fi
done 9< <(grep 'barcode_1_barcode.*' table.txt)
done < <(find . -name 'barcode_1_barcode*.gz' | grep -f <(cut -d' ' -f1 table.txt) ) ##: Remain the first column/field of table.txt
Output from the OP's sample data/files.
renamed './barcode_1_barcode_SL484370.fastq.gz' -> 'Description2.R1.fastq.gz'
renamed './barcode_1_barcode_SL484171.fastq.gz' -> 'Description1.R1.fastq.gz'
If you're satisfied with the output either move the # from the front of mv to the
front of printf or just delete the entire line with printf and remove the # from
mv in order for mv to actually rename the files.

Search file of directories and find file names, save to new file - bash

I'm trying to find the paths for some fastq.gz files in a mess of a system.
I have some folder paths in a file called temp (subset):
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG167/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG265/temp/
Let's assume 2 fastq.gz files are found in each directory in temp except for /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG265/temp/.
I want to find the fastq.gz files and print them (if found) next to the directory I'm searching in.
Ideal output:
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG167/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG167/NG167_S19_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/NG178_S1_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/NG178_S1_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/NG213_S20_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/NG213_S20_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/NG230_S23_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/NG230_S23_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/NG234_S18_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/NG234_S18_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/NG250_S2_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/NG250_S2_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/NG251_S3_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/NG251_S3_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/NG257_S4_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/NG257_S4_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/NG263_S22_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/NG263_S22_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG265/temp/ not_found
I'm part the way there:
wc -l temp
while read -r line; do cd $line; echo ${line} >> ~/tmp; find `pwd -P` -name "*fastq.gz" >> ~/tmp; done < temp
cd ~
less tmp
Current output:
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG167/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG167/NG167_S19_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG167/NG167_S19_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/NG178_S1_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/NG178_S1_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/NG213_S20_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/NG213_S20_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/NG230_S23_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/NG230_S23_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/NG234_S18_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/NG234_S18_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/NG250_S2_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/NG250_S2_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/NG251_S3_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/NG251_S3_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/NG257_S4_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/NG257_S4_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/NG263_S22_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/NG263_S22_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG265/temp/
My code places the directory searched for first, then any matching files on subsequent lines. I'm not sure how to get the output I desire...
Any help, gratefully received!
Thanks,
Not your original script but this version does not run cd and find on each line in this case each directory but the whole directory tree/structure just once and the parsing is done inside the while read loop.
#!/usr/bin/env bash
mapfile -t to_search < temp.txt
while IFS= read -rd '' files; do
if [[ $files == *.fastq.gz ]]; then
printf '%s found %s\n' "${files%/*}/" "$files"
else
printf '%s not_found!\n' "$files" >&2
fi
done < <(find "${to_search[#]%/*.fastq.gz*}" -print0) | column -t
This is how I would rewrite your script. Using cd in a subshell
#!/usr/bin/env bash
while read -r line; do
if [[ -d "$line" ]]; then
(
cd "$line" || exit
varname=$(find "$(pwd -P)" -name '*fastq.gz')
if [[ -n $varname ]]; then
printf '%s found %s\n' "$line" "$line${varname#*./}"
else
printf '%s not_found!\n' "$line"
fi
)
fi
done < temp.txt | column -t
Given a line -
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/NG263_S22_R2_001.fastq.gz
you can get what you want for the found lines quite easily with sed - just feed the lines to it.
... | sed -e 's#^\(.*/\)\([^/]*\)$#\1 found \1\2#'
However, that doesn't eliminate the line before.
To do that you either use something like awk (and do a simple state machine), or do something like this in sed (general idea here https://stackoverflow.com/a/25203093).
... | sed -e '#/$#{$!N;#\n.*gz$#!P;D}'
(although I think I have a typo as it is not working for me on osx).
So then you'd be left with the .gz lines already converted, and the lines ending in / where you can also use sed to then append the "not found".
... | sed -e 's#/$#/ not found#'

Expand shell glob in variable into array

In a bash script I have a variable containing a shell glob expression that I want to expand into an array of matching file names (nullglob turned on), like in
pat='dir/*.config'
files=($pat)
This works nicely, even for multiple patterns in $pat (e.g., pat="dir/*.config dir/*.conf), however, I cannot use escape characters in the pattern. Ideally, I would like to able to do
pat='"dir/*" dir/*.config "dir/file with spaces"'
to include the file *, all files ending in .config and file with spaces.
Is there an easy way to do this? (Without eval if possible.)
As the pattern is read from a file, I cannot place it in the array expression directly, as proposed in this answer (and various other places).
Edit:
To put things into context: What I am trying to do is to read a template file line-wise and process all lines like #include pattern. The includes are then resolved using the shell glob. As this tool is meant to be universal, I want to be able to include files with spaces and weird characters (like *).
The "main" loop reads like this:
template_include_pat='^#include (.*)$'
while IFS='' read -r line || [[ -n "$line" ]]; do
if printf '%s' "$line" | grep -qE "$template_include_pat"; then
glob=$(printf '%s' "$line" | sed -nrE "s/$template_include_pat/\\1/p")
cwd=$(pwd -P)
cd "$targetdir"
files=($glob)
for f in "${files[#]}"; do
printf "\n\n%s\n" "# FILE $f" >> "$tempfile"
cat "$f" >> "$tempfile" ||
die "Cannot read '$f'."
done
cd "$cwd"
else
echo "$line" >> "$tempfile"
fi
done < "$template"
Using the Python glob module:
#!/usr/bin/env bash
# Takes literal glob expressions on as argv; emits NUL-delimited match list on output
expand_globs() {
python -c '
import sys, glob
for arg in sys.argv[1:]:
for result in glob.iglob(arg):
sys.stdout.write("%s\0" % (result,))
' _ "$#"
}
template_include_pat='^#include (.*)$'
template=${1:-/dev/stdin}
# record the patterns we were looking for
patterns=( )
while read -r line; do
if [[ $line =~ $template_include_pat ]]; then
patterns+=( "${BASH_REMATCH[1]}" )
fi
done <"$template"
results=( )
while IFS= read -r -d '' name; do
results+=( "$name" )
done < <(expand_globs "${patterns[#]}")
# Let's display our results:
{
printf 'Searched for the following patterns, from template %q:\n' "$template"
(( ${#patterns[#]} )) && printf ' - %q\n' "${patterns[#]}"
echo
echo "Found the following files:"
(( ${#results[#]} )) && printf ' - %q\n' "${results[#]}"
} >&2

Looping list of folder path containing comma "," and spaces results in error

The folowing code work great but when the folder path contain "," and spaces make error
dir data/ > folder_file.txt
IFS=$'\n'
for file in "`cat folder_file.txt`"
do
printf 'File found: %s\n' "$file"
ls "data/$file/" #-----------> "," and "space" brook this task
done
any idea ? to escape special character
it work now any other advice's to make it better
IFS=$'\n'
a=0
for file in out/*; do
ls "$file" > html_file.txt
for file2 in `cat html_file.txt`; do
echo $file
mv "$file""/""$file2" "$file""/""page_"$a
let a=$a+1
done
a=0
done
This is how you loop on the content of a directory:
#!/bin/bash
shopt -s nullglob
for file in data/*; do
printf 'File found: %s\n' "$file"
ls "$file"
done
We use the shell options nullglob so that the glob * expands to nothing (and hence the loop is void) in case there are no matches.

remove file starting with space in shell scripting

I'm trying to write a shell script to cleanup a directory by deleting files that match particular patterns. My code works with all patterns but if the file name starts with space. Although we can delete a file starting with space by rm \ *however if I pass this pattern to my script it won't delete files starting with space. Here is my code:
for file in *;do
for pattern in $*; do
if [[ -f "$file" && "$file" == $pattern ]]; then
rm "$file"
fi
done
done
I also tried this simpler code, but the same problem!
for pattern in $*; do
if [[ -f $pattern ]]; then
rm $pattern
fi
done
Could you please help me why there is a problem just with files starting with space?!
Rather than $*, if you use the special parameter $#, the items in the list will start with quotes around them. You still have to quote the variables where you use them.
Reworking the second example, that would be
for pattern in "$#"; do
if [[ -f "$pattern" ]]; then
rm -f "$pattern"
fi
done
this is really a challenging one
for starters please see below example
[shravan#localhost mydir]$ ls " myfile"
myfile
[shravan#localhost mydir]$ echo $vr1
" myfile"
[shravan#localhost mydir]$ ls $vr1
ls: ": No such file or directory
ls: myfile": No such file or directory
[shravan#localhost mydir]$ vr2=" myfile"
[shravan#localhost mydir]$ echo $vr2
myfile
You can see above that ls " myfile" is working but it is not working after assigning this value in variable vr1 or vr2.
So we cannot do check of file if it exists or not.
For solution keep all you patterns in a file and all patterns in double quotes. see example below.
[shravan#localhost mydir]$ touch " myfile"
[shravan#localhost mydir]$ touch my.pl
[shravan#localhost mydir]$ ls
exe.sh findrm inp input myfile my.pl pattern text text1
[shravan#localhost mydir]$ cat inp
" myfile"
"my.pl"
[shravan#localhost mydir]$ cat inp | xargs rm
[shravan#localhost mydir]$ ls
exe.sh findrm inp input pattern text text1
The files are removed. Or if you have lot of patterns and dont want to add quotes to them use below.
cat inp | awk '{print "\""$0"\""}' | xargs rm
Yes if file is not found then it will give error for that file that
rm: cannot remove ` myfile': No such file or directory
for file in *;do
for pattern in "$#"; do
if [[ -f "$file" && "$file" == $pattern ]]; then
rm "$file"
fi
done
done
If we simply change $# to quoted "$#" then each individual argument would be wrapped in double quotation and no space would be lost. On the other hand we need a quoted string at the right of == operator, because when the '==' operator is used inside [[ ]], the string to the right of the operator is considered a pattern. But here we will not quote $pattern since all arguments in the list include double quotation.

Resources