Search file of directories and find file names, save to new file - bash - bash

I'm trying to find the paths for some fastq.gz files in a mess of a system.
I have some folder paths in a file called temp (subset):
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG167/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG265/temp/
Let's assume 2 fastq.gz files are found in each directory in temp except for /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG265/temp/.
I want to find the fastq.gz files and print them (if found) next to the directory I'm searching in.
Ideal output:
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG167/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG167/NG167_S19_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/NG178_S1_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/NG178_S1_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/NG213_S20_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/NG213_S20_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/NG230_S23_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/NG230_S23_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/NG234_S18_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/NG234_S18_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/NG250_S2_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/NG250_S2_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/NG251_S3_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/NG251_S3_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/NG257_S4_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/NG257_S4_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/NG263_S22_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/ found /temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/NG263_S22_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG265/temp/ not_found
I'm part the way there:
wc -l temp
while read -r line; do cd $line; echo ${line} >> ~/tmp; find `pwd -P` -name "*fastq.gz" >> ~/tmp; done < temp
cd ~
less tmp
Current output:
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG167/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG167/NG167_S19_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG167/NG167_S19_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/NG178_S1_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG178/NG178_S1_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/NG213_S20_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG213/NG213_S20_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/NG230_S23_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG230/NG230_S23_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/NG234_S18_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG234/NG234_S18_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/NG250_S2_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG250/NG250_S2_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/NG251_S3_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG251/NG251_S3_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/NG257_S4_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG257/NG257_S4_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/NG263_S22_R1_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/NG263_S22_R2_001.fastq.gz
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG265/temp/
My code places the directory searched for first, then any matching files on subsequent lines. I'm not sure how to get the output I desire...
Any help, gratefully received!
Thanks,

Not your original script but this version does not run cd and find on each line in this case each directory but the whole directory tree/structure just once and the parsing is done inside the while read loop.
#!/usr/bin/env bash
mapfile -t to_search < temp.txt
while IFS= read -rd '' files; do
if [[ $files == *.fastq.gz ]]; then
printf '%s found %s\n' "${files%/*}/" "$files"
else
printf '%s not_found!\n' "$files" >&2
fi
done < <(find "${to_search[#]%/*.fastq.gz*}" -print0) | column -t
This is how I would rewrite your script. Using cd in a subshell
#!/usr/bin/env bash
while read -r line; do
if [[ -d "$line" ]]; then
(
cd "$line" || exit
varname=$(find "$(pwd -P)" -name '*fastq.gz')
if [[ -n $varname ]]; then
printf '%s found %s\n' "$line" "$line${varname#*./}"
else
printf '%s not_found!\n' "$line"
fi
)
fi
done < temp.txt | column -t

Given a line -
/temp/CC49/DATA/Gh7d/NYSTAG_TSO_Mar16/NG263/NG263_S22_R2_001.fastq.gz
you can get what you want for the found lines quite easily with sed - just feed the lines to it.
... | sed -e 's#^\(.*/\)\([^/]*\)$#\1 found \1\2#'
However, that doesn't eliminate the line before.
To do that you either use something like awk (and do a simple state machine), or do something like this in sed (general idea here https://stackoverflow.com/a/25203093).
... | sed -e '#/$#{$!N;#\n.*gz$#!P;D}'
(although I think I have a typo as it is not working for me on osx).
So then you'd be left with the .gz lines already converted, and the lines ending in / where you can also use sed to then append the "not found".
... | sed -e 's#/$#/ not found#'

Related

Editing line with sed in for loop from other file

im new in bash,
Im trying to edit line with sed with for loop from other file.
Please tell me what i doing wrong in my small code?
Do i missing another loop?
#!/bin/bash
# read and taking the line needed:
for j in `cat /tmp/check.txt`; do
# replacing the old value with and value:
sed -i "s+/tmp/old_name/+/${j}/+gi" file_destantion.txt$$
#giving numbers to the logs for checking
Num=j +1
# moving the changed file to .log number ( as for see that it is changed):
mv file_destantion.txt$$ file_destantion.txt$$.log$Num
#create ne source file to do the next value from /tmp/check:
cp -rp file_destantion.txt file_destantion.txt$$
done
On /tmp/check i have the info that i want to enter on each loop turn.
in /tmp/check:
/tmp/check70
/tmp/check70_1
/tmp/_check7007
In the end this is what i want it to be like:
.log1 > will contain /tmp/check70
.log2 > will contain /tmp/check70_1
.log3 will contain /tmp/check7007
I have found this solution worked for me.
#!/bin/bash
count=0
grep -v '^ *#' < /tmp/check | while IFS= read -r line ;do
cp -rp file_destantion.txt file_destantion.txt$$
sed -i "s+/tmp/old_name/+${line}/+gi" file_destantion.txt$$
(( count++ ))
mv file_destantion.txt$$ "file_destantion.txt$$.log${count}"
cp -rp file_destantion.txt file_destantion.txt$$
done
thank you very much #Cyrus for your guiding.

Sequentially numbering of files in different folders while keeping the name after the number

I have a lot of ogg or wave files in different folders that I want to sequentially number while keeping everything that stands behind the prefixed number. The input may look like this
Folder1/01 Insbruck.ogg
02 From Milan to Rome.ogg
03 From Rome to Naples.ogg
Folder2/01 From Naples to Palermo.ogg
02 From Palermo to Syracrus.ogg
03 From Syracrus to Tropea
The output should be:
Folder1/01 Insbruck.ogg
02 From Milan to Rome.ogg
03 From Rome to Naples.ogg
Folder2/04 From Naples to Palermo.ogg
05 From Palermo to Syracrus.ogg
06 From Syracrus to Tropea.ogg
The sequential numbering across folders can be done with this BASH script that I found here:
find . | (i=0; while read f; do
let i+=1; mv "$f" "${f%/*}/$(printf %04d "$i").${f##*.}";
done)
But this script removes the title that I would like to keep.
TL;DR
Like this, using find and perl rename:
rename -n 's#/\d+#sprintf "/%0.2d", ++$::c#e' Folder*/*
Drop -n switch if the output looks good.
With -n, you only see the files that will really be renamed, so only 3 files from Folder2.
Going further
The variable $::c (or $main::c is a package variable) is a little hack to avoid the use of more complex expressions:
rename -n 's#/\d+#sprintf "/%0.2d", ++our $c#e' Folder*/*
or
rename -n '{ no strict; s#/\d+#sprintf "/%0.2d", ++$c#e; }' Folder*/*
or
rename -n '
do {
use 5.012;
state $c = 0;
s#/\d+#sprintf "/%0.2d", ++$c#e
}
' Folder*/*
Thanks go|dfish & Grinnz on freenode
A bash script for this job would be:
#!/bin/bash
argc=$#
width=${#argc}
n=0
for src; do
base=$(basename "$src")
dir=$(dirname "$src")
if ! [[ $base =~ ^[0-9]+\ .*\.(ogg|wav)$ ]]; then
echo "$src: Unexpected file name. Skipping..." >&2
continue
fi
printf -v dest "$dir/%0${width}d ${base#* }" $((++n))
echo "moving '$src' to '$dest'"
# mv -n "$src" "$dest"
done
and could be run as
./renum Folder*/*
assuming the script is saved as renum. It will just print out source and destination file names. To do actual moving, you should drop the # at the beginning of the line # mv -n "$src" "$dest" after making sure it will work as expected. Note that the mv command will not overwrite an existing file due to the -n option. This may or may not be desirable. The script will print out a warning message and skip unexpected file names, that is, the file names not fitting the pattern specified in the question.
The sequential numbering across folders can be done with this BASH script that I found here:
find . | (i=0; while read f; do
let i+=1; mv "$f" "${f%/*}/$(printf %04d "$i").${f##*.}";
done)
But this script removes the title that I would like to keep.
Not as robust as the accepted answer but this is the improved version of your script and just in case rename is not available.
#!/usr/bin/env bash
[[ -n $1 ]] || {
printf >&2 'Needs a directory as an argument!\n'
exit 1
}
n=1
directory=("$#")
while IFS= read -r files; do
if [[ $files =~ ^(.+)?\/([[:digit:]]+[^[:blank:]]+)(.+)$ ]]; then
printf -v int '%02d' "$((n++))"
[[ -e "${BASH_REMATCH[1]}/$int${BASH_REMATCH[3]}" ]] && {
printf '%s is already in sequential order, skipping!\n' "$files"
continue
}
echo mv -v "$files" "${BASH_REMATCH[1]}/$int${BASH_REMATCH[3]}"
fi
done < <(find "${directory[#]}" -type f | sort )
Now run the script with the directory in question as the argument.
./myscript Folder*/
or
./myscript Folder1/
or
./myscript Folder2/
or a . the . is the current directory.
./myscript .
and so on...
Remove the echo if you're satisfied with the output.

Extract a line from a text file using grep?

I have a textfile called log.txt, and it logs the file name and the path it was gotten from. so something like this
2.txt
/home/test/etc/2.txt
basically the file name and its previous location. I want to use grep to grab the file directory save it as a variable and move the file back to its original location.
for var in "$#"
do
if grep "$var" log.txt
then
# code if found
else
# code if not found
fi
this just prints out to the console the 2.txt and its directory since the directory has 2.txt in it.
thanks.
Maybe flip the logic to make it more efficient?
f=''
while read prev
do case "$prev" in
*/*) f="${prev##*/}"; continue;; # remember the name
*) [[ -e "$f" ]] && mv "$f" "$prev";;
done < log.txt
That walks through all the files in the log and if they exist locally, move them back. Should be functionally the same without a grep per file.
If the name is always the same then why save it in the log at all?
If it is, then
while read prev
do f="${prev##*/}" # strip the path info
[[ -e "$f" ]] && mv "$f" "$prev"
done < <( grep / log.txt )
Having the file names on the same line would significantly simplify your script. But maybe try something like
# Convert from command-line arguments to lines
printf '%s\n' "$#" |
# Pair up with entries in file
awk 'NR==FNR { f[$0]; next }
FNR%2 { if ($0 in f) p=$0; else p=""; next }
p { print "mv \"" p "\" \"" $0 "\"" }' - log.txt |
sh
Test it by replacing sh with cat and see what you get. If it looks correct, switch back.
Briefly, something similar could perhaps be pulled off with printf '%s\n' "$#" | grep -A 1 -Fxf - log.txt but you end up having to parse the output to pair up the output lines anyway.
Another solution:
for f in `grep -v "/" log.txt`; do
grep "/$f" log.txt | xargs -I{} cp $f {}
done
grep -q (for "quiet") stops the output

Replacing the duplicate uuids across multiple files

I am trying to replace the duplicate UUIDs from multiple files in a directory. Even the same file can have duplicate UUIDs.
I am using Unix utilities to solve this.
Till now I have used grep, cut, sort and uniq to find all the duplicate UUIDs across the folder and store it in a file (say duplicate_uuids)
Then I tried sed to replace the UUIDs by looping through the file.
filename="$1"
re="*.java"
while read line; do
uuid=$(uuidgen)
sed -i'.original' -e "s/$line/$uuid/g" *.java
done < "$filename"
As you would expect, I ended up replacing all the duplicate UUIDs with new UUID but still, it is duplicated throughout the file!
Is there any sed trick that can work for me?
There are a bunch of ways this can likely be done. Taking a multi-command approach using a function might give you greater flexibility if you want to customize things later, for example:
#!/bin/bash
checkdupes() {
files="$*"
for f in $files; do
filename="$f"
printf "Searching File: %s\n" "${filename}"
while read -r line; do
arr=( $(grep -n "${line}" "${filename}" | awk 'BEGIN { FS = ":" } ; {print $1" "}') )
for i in "${arr[#]:1}"; do
sed -i '' ''"${i}"'s/'"${line}"'/'"$(uuidgen)"'/g' "${filename}"
printf "Replaced UUID [%s] at line %s, first found on line %s\n" "${line}" "${i}" "${arr[0]}"
done
done< <( sort "${filename}" | uniq -d )
done
}
checkdupes /path/to/*.java
So what this series of commands does is to first sort the duplicates (if any) in whatever file you choose. It takes those duplicates and uses grep and awk to create an array of line numbers which each duplicate is found. Looping through the array (while skipping the first value) will allow the duplicates to be replaced by a new UUID and then re-saving the file.
Using a duplicate list file:
If you want to use a file with a list of dupes to search other files and replace the UUID in each of them that match it's just a matter of changing two lines:
Replace:
for i in "${arr[#]:1}"; do
With:
for i in "${arr[#]}"; do
Replace:
done< <( sort "${filename}" | uniq -d )
With:
done< <( cat /path/to/dupes_list )
NOTE: If you don't want to overwrite the file, then remove sed -i '' at the beginning of the command.
This worked for me:
#!/bin/bash
duplicate_uuid=$1
# store file names in array
find . -name "*.java" > file_names
IFS=$'\n' read -d '' -r -a file_list < file_names
# store file duplicate uuids from file to array
IFS=$'\n' read -d '' -r -a dup_uuids < $duplicate_uuid
# loop through all files
for file in "${file_list[#]}"
do
echo "$file"
# Loop through all repeated uuids
for old_uuid in "${dup_uuids[#]}"
do
START=1
# Get the number of times uuid present in this file
END=$(grep -c $old_uuid $file)
if (( $END > 0 )) ; then
echo " Replacing $old_uuid"
fi
# Loop through them one by one and change the uuid
for (( c=$START; c<=$END; c++ ))
do
uuid=$(uuidgen)
echo " [$c of $END] with $uuid"
sed -i '.original' -e "1,/$old_uuid/s/$old_uuid/$uuid/" $file
done
done
rm $file.original
done
rm file_names

create and rename multiple copies of files

I have a file input.txt that looks as follows.
abas_1.txt
abas_2.txt
abas_3.txt
1fgh.txt
3ghl_1.txt
3ghl_2.txt
I have a folder ff. The filenames of this folder are abas.txt, 1fgh.txt, 3ghl.txt. Based on the input file, I would like to create and rename the multiple copies in ff folder.
For example in the input file, abas has three copies. In the ff folder, I need to create the three copies of abas.txt and rename it as abas_1.txt, abas_2.txt, abas_3.txt. No need to copy and rename 1fgh.txt in ff folder.
Your valuable suggestions would be appreciated.
You can try something like this (to be run from within your folder ff):
#!/bin/bash
while IFS= read -r fn; do
[[ $fn =~ ^(.+)_[[:digit:]]+\.([^\.]+)$ ]] || continue
fn_orig=${BASH_REMATCH[1]}.${BASH_REMATCH[2]}
echo cp -nv -- "$fn_orig" "$fn"
done < input.txt
Remove the echo if you're happy with it.
If you don't want to run from within the folder ff, just replace the line
echo cp -nv -- "$fn_orig" "$fn"
with
echo cp -nv -- "ff/$fn_orig" "ff/$fn"
The -n option to cp so as to not overwrite existing files, and the -v option to be verbose. The -- tells cp that there are no more options beyond this point, so that it will not be confused if one of the files starts with a hyphen.
using for and grep :
#!/bin/bash
for i in $(ls)
do
x=$(echo $i | sed 's/^\(.*\)\..*/\1/')"_"
for j in $(grep $x in)
do
cp -n $i $j
done
done
Try this one
#!/bin/bash
while read newFileName;do
#split the string by _ delimiter
arr=(${newFileName//_/ })
extension="${newFileName##*.}"
fileToCopy="${arr[0]}.$extension"
#check for empty : '1fgh.txt' case
if [ -n "${arr[1]}" ]; then
#check if file exists
if [ -f $fileToCopy ];then
echo "copying $fileToCopy -> $newFileName"
cp "$fileToCopy" "$newFileName"
#else
# echo "File $fileToCopy does not exist, so it can't be copied"
fi
fi
done
You can call your script like this:
cat input.txt | ./script.sh
If you could change the format of input.txt, I suggest you adjust it in order to make your task easier. If not, here is my solution:
#!/bin/bash
SRC_DIR=/path/to/ff
INPUT=/path/to/input.txt
BACKUP_DIR=/path/to/backup
for cand in `ls $SRC_DIR`; do
grep "^${cand%.*}_" $INPUT | while read new
do
cp -fv $SRC_DIR/$cand $BACKUP_DIR/$new
done
done

Resources