For Loop with Filename Pairing - for-loop

I am trying to create a script that will, within a for loop, identify pairs of files in a directory and then perform a function on each pair. The paired files are named such as FILENAME_1.fastq and FILENAME_2.fastq and there are multiple pairs within the directory. Here are some actual filenames in case it matters for regex functions:
WT1_0min-SRR9929263_1.fastq
WT1_0min-SRR9929263_2.fastq
WT1_20min-SRR9929265_1.fastq
WT1_20min-SRR9929265_2.fastq
WT3_20min-SRR12062597_1.fastq
WT3_20min-SRR12062597_2.fastq
Is it possible to do this without feeding any information about the filename, other than that it has a pair? I am absolutely awful with regex and name-search functions, but below is my latest failed attempt.
cd ~/Directory
for file in *.fastq
do
sample=`basename ${file}` #I think needs a modification to subtract the _1 or _2 and then a search function to find the paired files
myfunction \
-1 ${sample}_1.fastq \
-2 ${sample}_2.fastq \
done
Thanks for any help. Been stuck for 2 days x_x
UPDATE
Please see this new post for answers on how to adapt the xarg answer for use with a for loop.

to account for the scenario not every file is fully paired , try
file . -depth 1 type f -not -name ".*" | \
\
gawk 'BEGIN { FS="_"; } { $ 0 = gensub(/^.+\/([^\/]+)$/ , "\\1", "1"); }
{ inL[$1$2][substr($3,1,1)] = $0 ; }
END { OFS = ORS = "\0";
for (pfx in inL) {
if (1 in inL[pfx]) && (2 in inL[pfx]) && \
(length(inL[pfx])==2)
{ print inL[pfx][1], inL[pfx][2]; } } }' | \
\
parallel -0 -N 2 -j 1 myfunction -1 '{}' -2 '{}' ;
gnu parallel allows for functions to be exported. this version of the code will use gawk to handle both basename and print0 functions. This also ensures ONLY files with exact 1+2 parings will be shown in the end, in case there are ones with only one of the 2, or some files with even a "_3.fastq", shall your need to expand into such a realm.

Use find and xargs, and replace echo with the command of your choice:
find . -name '*_1.fastq' -exec basename {} '_1.fastq' \; | xargs -n1 -I{} echo {}_1.fastq {}_2.fastq

In bioinformatics we use this all the time when having a paired end run:
parallel --plus echo {} {/_1.fastq/_2.fastq} ::: *_1.fastq
or:
parallel echo {} {=s/_1.fastq/_2.fastq/=} ::: *_1.fastq

Related

Bash find: exec in reverse oder

I am iterating over files like so:
find $directory -type f -exec codesign {} \;
Now the problem here is that files on a higher hierarchy are signed first.
Is there a way to iterate over a directory tree and handle the deepest files first?
So that
/My/path/to/app/bin
is handled before
/My/path/mainbin
Yes, just use -depth:
-depth
The primary shall always evaluate as true; it shall cause descent of the directory hierarchy to be done so that all entries in a directory are acted on before the directory itself. If a -depth primary is not specified, all entries in a directory shall be acted on after the directory itself. If any -depth primary is specified, it shall apply to the entire expression even if the -depth primary would not normally be evaluated.
For example:
$ mkdir -p top/a/b/c/d/e/f/g/h
$ find top -print
top
top/a
top/a/b
top/a/b/c
top/a/b/c/d
top/a/b/c/d/e
top/a/b/c/d/e/f
top/a/b/c/d/e/f/g
top/a/b/c/d/e/f/g/h
$ find top -depth -print
top/a/b/c/d/e/f/g/h
top/a/b/c/d/e/f/g
top/a/b/c/d/e/f
top/a/b/c/d/e
top/a/b/c/d
top/a/b/c
top/a/b
top/a
top
Note that at a particular level, ordering is still arbitrary.
Using GNU utilities, and decorate-sort-undecorate pattern (aka Schwartzian transform):
find . -type f -printf '%d %p\0' |
sort -znr |
sed -z 's/[0-9]* //' |
xargs -0 -I# echo codesign #
Drop the echo if the output looks ok.
Using find's -depth option as my other answer, or naive sort as some others, only ensures that sub-directories of a directory are processed before the directory itself, but not that the deepest level is processed first.
For example:
$ mkdir -p top/a/b/d/f/h top/a/c/e/g
$ find top -depth -print
top/a/c/e/g
top/a/c/e
top/a/c
top/a/b/d/f/h
top/a/b/d/f
top/a/b/d
top/a/b
top/a
top
For overall deepest level to be processed first, the ordering should be something like:
top/a/b/d/f/h
top/a/c/e/g
top/a/b/d/f
top/a/c/e
top/a/b/d
top/a/c
top/a/b
top/a
top
To determine this ordering, the entire list must be known, and then the number of levels (ie. /) of each path counted to enable ranking.
A simple-ish Perl script (assigned to a shell function for this example) to do this ordering is:
$ dsort(){
perl -ne '
BEGIN { $/ = "\0" } # null-delimited i/o
$fname[$.] = $_;
$depth[$.] = tr|/||;
END {
print
map { $fname[$_] }
sort { $depth[$b] <=> $depth[$a] }
keys #fname
}
'
}
Then:
$ find top -print0 | dsort | xargs -0 -I# echo #
top/a/b/d/f/h
top/a/c/e/g
top/a/b/d/f
top/a/c/e
top/a/b/d
top/a/c
top/a/b
top/a
top
How about sorting the output of find in descending order:
while IFS= read -d "" -r f; do
codesign "$f"
done < <(find "$directory" -type f -print0 | sort -zr)
<(command ..) is a process substitution which feeds the output
of the command to the read command in while loop via the redirect.
-print0, sort -z and read -d "" combo uses a null character
as a file delimiter. It is useful to protect filenames which include
special characters such as whitespace.
I don't know if there is a native way in find, but you may pipe the output of it into a loop and process it line by line as you wish this way:
find . | while read file; do echo filename: "$file"; done
In your case, if you are happy just reversing the output of find, you may go with something like:
find $directory -type f | tac | while read file; do codesign "$file"; done

Want to add the headers in text file in shell script

I want to add the header at the start of the file for that I use the following code but it will not add the header can please help where i am doing wrong.
start_MERGE_JobRec()
{
FindBatchNumber
export TEMP_SP_FORMAT="Temp_${file_indicator}_SP_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_INSTANCE[0-9][0-9].txt"
export TEMP_SS_FORMAT="Temp_${file_indicator}_SS_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_INSTANCE[0-9][0-9].txt"
export TEMP_SG_FORMAT="Temp_${file_indicator}_SG_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_INSTANCE[0-9][0-9].txt"
export TEMP_GS_FORMAT="Temp_${file_indicator}_GS_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_INSTANCE[0-9][0-9].txt"
export SP_OUTPUT_FILE="RTBCON_${file_indicator}_SP_${ONLINE_DATE}${TIME}_${BATCH_NUMBER}.txt"
export SS_OUTPUT_FILE="RTBCON_${file_indicator}_SS_${ONLINE_DATE}${TIME}_${BATCH_NUMBER}.txt"
export SG_OUTPUT_FILE="RTBCON_${file_indicator}_SG_${ONLINE_DATE}${TIME}_${BATCH_NUMBER}.txt"
export GS_OUTPUT_FILE="RTBCON_${file_indicator}_GS_${ONLINE_DATE}${TIME}_${BATCH_NUMBER}.txt"
#---------------------------------------------------
# Add header at the start for each file
#---------------------------------------------------
awk '{print "recordType|lifetimeId|MSISDN|status|effectiveDate|expiryDate|oldMSISDN|accountType|billingAccountNumber|usageTypeBen|IMEI|IMSI|cycleCode|cycleMonth|firstBillExperience|recordStatus|failureReason"$0}' >> $SP_OUTPUT_FILE
find . -maxdepth 1 -type f -name "${TEMP_SP_FORMAT}" -exec cat {} + >> $SP_OUTPUT_FILE
find . -maxdepth 1 -type f -name "${TEMP_SS_FORMAT}" -exec cat {} + >> $SS_OUTPUT_FILE
find . -maxdepth 1 -type f -name "${TEMP_SG_FORMAT}" -exec cat {} + >> $SG_OUTPUT_FILE
find . -maxdepth 1 -type f -name "${TEMP_GS_FORMAT}" -exec cat {} + >> $GS_OUTPUT_FILE
}
I use awk to add the header but it's not working.
Awk requires an input file before it will print anything.
A common way to force Awk to print something even when there is no input is to put the print statement in a BEGIN block;
awk 'BEGIN { print "something" }' /dev/null
but if you want to prepend a header to all the output files, I don't see why you are using Awk here at all, let alone printing the header in front of every output line. Are you looking for this, instead?
echo 'recordType|lifetimeId|MSISDN|status|effectiveDate|expiryDate|oldMSISDN|accountType|billingAccountNumber|usageTypeBen|IMEI|IMSI|cycleCode|cycleMonth|firstBillExperience|recordStatus|failureReason' |
tee "$SS_OUTPUT_FILE" "$SG_OUTPUT_FILE" "$GS_OUTPUT_FILE" >"$SP_OUTPUT_FILE"
Notice also how we generally always quote shell variables unless we specifically want the shell to perform word splitting and wildcard expansion on their values, and avoid upper case for private variables.
There also does not seem to be any reason to export your variables -- neither Awk nor find pays any attention to them, and there are no other processes here. The purpose of export is to make a variable visible to the environment of subprocesses. You might want to declare them as local, though.
Perhaps break out a second function to avoid all this code repetition, anyway?
merge_individual_job() {
echo 'recordType|lifetimeId|MSISDN|status|effectiveDate|expiryDate|oldMSISDN|accountType|billingAccountNumber|usageTypeBen|IMEI|IMSI|cycleCode|cycleMonth|firstBillExperience|recordStatus|failureReason'
find . -maxdepth 1 -type f -name "$1" -exec cat {} +
}
start_MERGE_JobRec()
{
FindBatchNumber
local id
for id in SP SS SG GS; do
merge_individual_job \
"Temp_${file_indicator}_${id}_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_INSTANCE[0-9][0-9].txt" \
>"RTBCON_${file_indicator}_${id}_${ONLINE_DATE}${TIME}_${BATCH_NUMBER}.txt"
done
}
If FindBatchNumber sets the variable file_indicator, a more idiomatic and less error-prone approach is to have it just echo it, and have the caller assign it:
file_indicator=$(FindBatchNumber)

Script for printing out file names and their number of appearance starting from a given folder

I need to write a shell script, which starts with a given folder name as an argument will print out the names of folder and files in it and how many times does each name appear in the given folder.
edit I need to check only their names, without taking into consideration the file extensions.
#!/bin/bash
folder="$1"
for f in "$folder"
do
echo "$f"
done
And I would expect to see something like this (if i have 3 files with the same name and different extension like x.html, x.css, x.sh and so on, in a directory called dir)
x
3 times
after executing the script with dir (the name of the directory) as a parameter.
The find command already does most of this for you.
find . -printf "%f\n" |
sort | uniq -c
This will not work correctly if you have files whose names contain a newline.
If your find doesn't support -printf, maybe try
find . -exec basename {} \; |
sort | uniq -c
To restrict to just file names or directory names, add -type f or -type d, respectively, before the action (-exec or -printf).
If you genuinely want to remove extensions, try
find .... whatever ... |
sed 's%\.[^./]*$%%' |
sort | uniq -c
Can you try this,
#!/bin/bash
IFS=$'\n' array=($(ls))
iter=0;
for file in ${array[*]}; do
filename=$(basename -- "$file")
extension="${filename##*.}"
filename="${filename%.*}"
filenamearray[$iter]=$filename
iter=$((iter+1))
done
for filename in ${filenamearray[#]}; do
echo $filename;
grep -o $filename <<< "${filenamearray[#]}" | wc -l
done
You can try with find and awk :
find . -type f -print0 |
awk '
BEGIN {
FS="/"
RS="\0"
}
{
k = split( $NF , b , "." )
if ( k > 1 )
sub ( "."b[k] , "" , $NF )
a[$NF]++
}
END {
for ( i in a ) {
j = a[i]>1 ? "s" : ""
print i
print a[i] " time" j
}
}'

Using bash: how do you find a way to search through all your files in a directory (recursively?)

I need a command that will help me accomplish what I am trying to do. At the moment, I am looking for all the ".html" files in a given directory, and seeing which ones contain the string "jacketprice" in any of them.
Is there a way to do this? And also, for the second (but separate) command, I will need a way to replace every instance of "jacketprice" with "coatprice", all in one command or script. If this is feasible feel free to let me know. Thanks
find . -name "*.html" -exec grep -l jacketprice {} \;
for i in `find . -name "*.html"`
do
sed -i "s/jacketprice/coatprice/g" $i
done
As for the second question,
find . -name "*.html" -exec sed -i "s/jacketprice/coatprice/g" {} \;
Use recursive grep to search through your files:
grep -r --include="*.html" jacketprice /my/dir
Alternatively turn on bash's globstar feature (if you haven't already), which allows you to use **/ to match directories and sub-directories.
$ shopt -s globstar
$ cd /my/dir
$ grep jacketprice **/*.html
$ sed -i 's/jacketprice/coatprice/g' **/*.html
Depending on whether you want this recursively or not, perl is a good option:
Find, non-recursive:
perl -nwe 'print "Found $_ in file $ARGV\n" if /jacketprice/' *.html
Will print the line where the match is found, followed by the file name. Can get a bit verbose.
Replace, non-recursive:
perl -pi.bak -we 's/jacketprice/coatprice/g' *.html
Will store original with .bak extension tacked on.
Find, recursive:
perl -MFile::Find -nwE '
BEGIN { find(sub { /\.html$/i && push #ARGV, $File::Find::name }, '/dir'); };
say $ARGV if /jacketprice/'
It will print the file name for each match. Somewhat less verbose might be:
perl -MFile::Find -nwE '
BEGIN { find(sub { /\.html$/i && push #ARGV, $File::Find::name }, '/dir'); };
$found{$ARGV}++ if /jacketprice/; END { say for keys %found }'
Replace, recursive:
perl -MFile::Find -pi.bak -we '
BEGIN { find(sub { /\.html$/i && push #ARGV, $File::Find::name }, '/dir'); };
s/jacketprice/coatprice/g'
Note: In all recursive versions, /dir is the bottom level directory you wish to search. Also, if your perl version is less than 5.10, say can be replaced with print followed by newline, e.g. print "$_\n" for keys %found.

How can I copy all my disorganized files into a single directory? (on linux)

I have thousands of mp3s inside a complex folder structure which resides within a single folder. I would like to move all the mp3s into a single directory with no subfolders. I can think of a variety of ways of doing this using the find command but one problem will be duplicate file names. I don't want to replace files since I often have multiple versions of a same song. Auto-rename would be best. I don't really care how the files are renamed.
Does anyone know a simple and safe way of doing this?
You could change a a/b/c.mp3 path into a - b - c.mp3 after copying. Here's a solution in Bash:
find srcdir -name '*.mp3' -printf '%P\n' |
while read i; do
j="${i//\// - }"
cp -v "srcdir/$i" "dstdir/$j"
done
And in a shell without ${//} substitution:
find srcdir -name '*.mp3' -printf '%P\n' |
sed -e 'p;s:/: - :g' |
while read i; do
read j
cp -v "srcdir/$i" "dstdir/$j"
done
For a different scheme, GNU's cp and mv can make numbered backups instead of overwriting -- see -b/--backup[=CONTROL] in the man pages.
find srcdir -name '*.mp3' -exec cp -v --backup=numbered {} dstdir/ \;
bash like pseudocode:
for i in `find . -name "*.mp3"`; do
NEW_NAME = `basename $i`
X=0
while ! -f move_to_dir/$NEW_NAME
NEW_NAME = $NEW_NAME + incr $X
mv $i $NEW_NAME
done
#!/bin/bash
NEW_DIR=/tmp/new/
IFS="
"; for a in `find . -type f `
do
echo "$a"
new_name="`basename $a`"
while test -e "$NEW_DIR/$new_name"
do
new_name="${new_name}_"
done
cp "$a" "$NEW_DIR/$new_name"
done
I'd tend to do this in a simple script rather than try to fit in in a single command line.
For instance, in python, it would be relatively trivial to do a walk() through the directory, copying each mp3 file found to a different directory with an automatically incremented number.
If you want to get fancier, you could have a dictionary of existing file names, and simply append a number to the duplicates. (the index of the dictionary being the file name, and the value being the number of files found so far, which would become your suffix)
find /path/to/mp3s -name *.mp3 -exec mv \{\} /path/to/target/dir \;
At the risk of many downvotes, a perl script could be written in short time to accomplish this.
Pseudocode:
while (-e filename)
change filename to filename . "1";
In python: to actually move the file, change debug=False
import os, re
from_dir="/from/dir"
to_dir = "/target/dir"
re_ext = "\.mp3"
debug = True
w = os.walk(from_dir)
n = w.next()
while n:
d, arg, names = n
names = filter(lambda fn: re.match(".*(%s)$"%re_ext, fn, re.I) , names)
n = w.next()
for fn in names:
from_fn = os.path.join(d,fn)
target_fn = os.path.join(to_dir, fn)
file_exists = os.path.exists(target_fn)
if not debug:
if not file_exists:
os.rename(from_fn, target_fn)
else:
print "DO NOT MOVE - FILE EXISTS ", from_fn
else:
print "MOVE ", from_fn, " TO " , target_fn
Since you don't care how the duplicate files are named, utilize the 'backup' option on move:
find /path/to/mp3s -name *.mp3 -exec mv --backup=numbered {} /path/to/target/dir \;
Will get you:
song.mp3
song.mp3.~1~
song.mp3.~2~

Resources