For loop within a for loop for iterating files of different extensions - shell

Say I have 20 different files. First 10 files end with .counts.tsv and the rest of the files end with .libsize.tsv. For each .counts.tsv there are matching .libsize.tsv files. I would like to use a for loop for selecting both of these files and run an R script for on those two files types.
Here is what I tried,
#!/bin/bash
arti='/home/path/tofiles'
for counts in ${arti}/*__counts.tsv ; do
for libsize in "$arti"/*__libsize.tsv ; do
Rscript score.R ${counts} ${libsize}
done;
done;
The above shell script iterates over the files more than 200 times whereas I have only 20 files. I need the Rscript to be executed 10 times for both files. Any suggestions would be appreciated.

I started typing up an answer before seeing your comment that you're only interested in a bash solution, posting anyway in case someone finds this question in the future and is open to an R based solution.
If I were approaching this from scratch, I'd probably just use an R function defined in the file that takes the two file names instead of messing around with the system() calls, but this would provide the behavior you desire.
## Get a vector of files matching each extension
counts_names <- list.files(path = ".", pattern ="*.counts.tsv")
libsize_names <- list.files(path = ".", pattern ="*.libsize.tsv")
## Get the root names of the files before the extensions
counts_roots <- gsub(".counts.tsv$", "",counts_names)
libsize_roots <- gsub(".libsize.tsv$", "",libsize_names)
## Get only root names that have both file types
shared_roots <- intersect(libsize_roots,counts_roots)
## Loop through the shared root names and execute an Rscript call based on the two files
for(i in seq_along(shared_roots)){
counts_filename <- paste0(shared_roots[[i]],".counts.tsv")
libsize_filename <- paste0(shared_roots[[i]],".libsize.tsv")
Command <- paste("Rscript score.R",counts_filename,libsize_filename)
system(Command)
}

Construct the second filename with ${counts%counts.tsv} (remove last part).
#!/bin/bash
arti='/home/path/tofiles'
for counts in ${arti}/*__counts.tsv ; do
libsize="${counts%counts.tsv}libsize.tsv"
Rscript score.R "${counts}" "${libsize}"
done
EDIT:
Less safe is trying to make it an oneliner. When the filenames are without spaces and newlines, you can risk an accident with
echo ${arti}/*counts.tsv ${arti}/*.libsize.tsv | xargs -n2 Rscript score.R
and when you feel really lucky (with no other files than those tsv files in $arti) make a bungee jump with
echo ${arti}/* | xargs -n2 Rscript score.R

Have you tried list.files in base? This will allow you to use all files in the folder.
arti='/home/path/tofiles'
for i in list.files(arti) {
script
}

See whether the below helps.
my_list = list.files("./Data")
counts = grep("counts.tsv", my_list, value=T)
libsize = grep("libsize.tsv", my_list, value=T)
for (i in seq(length(counts))){
system(paste("Rscript score.R",counts[i],libsize[i]))
}

Finally,
I tried the following and it helped me,
for sam in "$arti"/*__counts.tsv ; do
filebase=$(basename $sam)
samples=$(ls -1 ${filebase}|awk -F'[-1]' '{print $1}')
Rscript score.R ${samples}__counts.tsv ${samples}__libsize.tsv
done;
For someone looking for something similar :)

Related

Bash - Read Directory Path From TXT, Append Executable, Then Execute

I am setting up a directory structure with many different R & bash scripts in it. They all will be referencing files and folders. Instead of hardcoding the paths I would like to have a text file where each script can search for a descriptor in the file (see below) and read the relevant path from that.
Getting the search-append to work in R is easy enough for me; I am having trouble getting it to work in Bash, since I don't know the language very well.
My guess is it has something to do with the way awk works / stores the variable, or maybe the way the / works on the awk output. But I'm not familiar enough with it and would really appreciate any help
Text File "Master_File.txt":
NOT_DIRECTORY "/file/paths/Fake"
JOB_TEST_DIRECTORY "/file/paths/Real"
ALSO_NOT_DIRECTORY "/file/paths/Fake"
Bash Script:
#! /bin/bash
master_file_name="Master_File.txt"
R_SCRIPT="RScript.R"
SRCPATH=$(awk '/JOB_TEST_DIRECTORY/ { print $2 }' $master_file_name)
Rscript --vanilla $SRCPATH/$R_SCRIPT
The last line, $SRCPATH/$R_SCRIPT, seems to be replacing part of SRCPath with the name of $R_SCRIPT which outputs something like /RScript.Rs/Real instead of what I would like, which is /file/paths/Real/RScript.R.
Note: if I hard code the path path="/file/paths/Real" then the code $path/$R_SCRIPT outputs what I want.
The R Script:
system(command = "echo \"SUCCESSFUL_RUN\"", intern = FALSE, wait = TRUE)
q("no")
Please let me know if there's any other info that would be helpful, I added everything I could think of. And thank you.
Edit Upon Answer:
I found two solutions.
Solution 1 - By Mheni:
[ see his answer below ]
Solution 2 - My Adaptation of Mheni's Answer:
After seeing a Mehni's note on ignoring the " quotation marks, I looked up some more stuff, and found out it's possible to change the character that awk used to determine where to separate the text. By adding a -F\" to the awk call, it successfully separates based on the " character.
The following works
#!/bin/bash
master_file_name="Master_File.txt"
R_SCRIPT="RScript.R"
SRCPATH=$(awk -F\" -v r_script=$R_SCRIPT '/JOB_TEST_DIRECTORY/ { print $2 }' $master_file_name)
Rscript --vanilla $SRCPATH/$R_SCRIPT
Thank you so much everyone that took the time to help me out. I really appreciate it.
the problem is because of the quotes around the path, this change to the awk command ignores them when printing the path.
there was also a space in the shebang line that shouldn't be there as #david mentioned
#!/bin/bash
master_file_name="/tmp/data"
R_SCRIPT="RScript.R"
SRCPATH=$(awk '/JOB_TEST_DIRECTORY/ { if(NR==2) { gsub("\"",""); print $2 } }' "$master_file_name")
echo "$SRCPATH/$R_SCRIPT"
OUTPUT
[1] "Hello World!"
in my example the paths are in /tmp/data
NOT_DIRECTORY "/tmp/file/paths/Fake"
JOB_TEST_DIRECTORY "/tmp/file/paths/Real"
ALSO_NOT_DIRECTORY "/tmp/file/paths/Fake"
and in the path that corresponds to JOB_TEST_DIRECTORY i have a simple hello_world R script
[user#host tmp]$ cat /tmp/file/paths/Real/RScript.R
print("Hello World!")
I would use
Master_File.txt :
NOT_DIRECTORY="/file/paths/Fake"
JOB_TEST_DIRECTORY="/file/paths/Real"
ALSO_NOT_DIRECTORY="/file/paths/Fake"
Bash Script:
#!/bin/bash
R_SCRIPT="RScript.R"
if [[ -r /path/to/Master_File.txt ]]; then
. /path/to/Master_File.txt
else
echo "ERROR -- Can't read Master_File"
exit
fi
Rscript --vanilla $JOB_TEST_DIRECTORY/$R_SCRIPT
Basically, you create a configuration file Key=value, source it then use the the keys as variable for whatever you need throughout the script.

One-line program to delete files with few header lines

This is the next part of my earlier question perl one-liner to keep only desired lines. Here I have many *.fa files in a folder.
Suppose for three files: 1.fa, 2.fa, 3.fa
The contents of them are as follows:
1.fa
>djhnk_9
abfgdddcfdafaf
ygdugidg
>kjvk.80
jdsfkdbfdkfadf
>jnck_q2
fdgsdfjghsjhsfddf
>7ytiu98
ihdlfwdfjdlfl]ol
2.fa
>cj76
dkjfhkdjcfhdjk
>67q32
nscvsdkvklsflplsad
>kbvbk
cbjfdikjbfadkjfbka
3.fa
>1290.5
mnzmnvjbsdjb
The lines that start with a > are the headers and the rest are the feature lines.
I want to delete those files that have 3 or fewer header lines. Here, file 2.fa and file 3.fa should be deleted.
As I am working on a Windows system, preferably I use a one-line Perl script like:
for %%F in ("*.fa") do perl ...
Is there a one-line program for that?
Use a program. "One-liners" are inscrutable, non-portable, and very hard to debug
This does as you ask. I hope it's clear that I have commented out the unlink call for testing purposes: it would be a pain to regenerate the *.fa files each time
You will probably want to change '[0-9].fa' to just *.fa. I had other files in my own directory that I didn't want to be considered
use strict;
use warnings 'all';
while ( my $file = glob '[0-9].fa' ) {
open my $fh, '<', $file;
my $headers = grep /^>/, <$fh>;
#unlink $file if $headers <= 3;
print qq{deleting "$file"\n} if $headers <= 3;
}
output
deleting "2.fa"
deleting "3.fa"
Next time, please try to write some code by yourself to solve the problem, and only after come ask for help. You will learn more if you do that, and we won't feel like you're just asking us to write your code.
The problem is very simple though, so here's a solution.
Note that this solution should be considered as a quick fix. Borodin suggested cleaner, easier to understand and more portable way to do this here.
I would suggest doing this with perl like this :
perl -nE "$count{$ARGV}++ if /^>/; END { unlink grep { $count{$_} <= 3 } keys %count }" *.fa
(for the record, I'm using double-quotes" as the delimiter of the string since you are on windows, but if anyone wish to use this on an unix system, just change the double-quotes " for some single-quotes').
Explanations:
-n surround the code with while(<>){...}, which will read the files one by one.
With $count{$ARGV}++ if /^>/ we count the number of headers in each file : $ARGV holds the name of the file being read, and /^>/ is true only if the line starts with >, ie. it's a header line.
Finally ( the END { .. } part), we delete (with the function unlink) the files that have 3 headers or less : keys %count gives all the file names, and grep { $count{$_} <= 3 } retains only the files that have 3 or less header lines to delete them.

Bash: Move multiple files with same name into same directory by renaming

I currently have a bunch of .txt files in separate folders and want to move them all into the same folder, except all the files have the same name. I would like to preserve all the files by adding some sort of number so that each one isn't overwritten, like FolderA/file.txt becomes NewFolder/file_1.txt, and FolderB/file.txt becomes NewFolder/file_2.txt, etc. Is there a clean way to do this using bash? Thanks in advance for your help!
You could do something like this (either in a script or right on the command line):
for i in A B C D E
do
mv Folder$i/file.txt NewFolder/file_$i.txt
done
It won't convert letters to numbers, but it does the basics of what you're looking for in a fairly simple fashion.
Hope this helps.
Following the previous answer, you can add two lines of code in bash to achieve your desired output:
declare -i n=1;
for i in A B C D E
do
mv Folder$i/file.txt NewFolder/file_$n.txt
n+=1
done

Call script on all file names starting with string in folder bash

I have a set of files I want to perform an action on in a folder that i'm hoping to write a scipt for. Each file starts with mazeFilex where x can vary from any number , is there a quick and easy way to perform an action on each file? e.g. I will be doing
cat mazeFile0.txt | ./maze_ppm 5 | convert - maze0.jpg
how can I select each file knowing the file will always start with mazeFile?
for fname in mazeFile*
do
base=${fname%.txt}
base=${base#mazeFile}
./maze_ppm 5 <"$fname" | convert - "maze${base}.jpg"
done
Notes
for fname in mazeFile*; do
This codes starts the loop. Written this way, it is safe for all filenames, whether they have spaces, tabs or whatever in their names.
base=${fname%.txt}; base=${base#mazeFile}
This removes the mazeFile prefix and .txt suffix to just leave the base name that we will use for the output file.
./maze_ppm 5 <"$fname" | convert - "maze${base}.jpg"
The output filename is constructed using base. Note also that cat was unnecessary and has been removed here.
for i in mazeFile*.txt ; do ./maze_ppm 5 <$i | convert - `basename maze${i:8} .txt`.jpg ; done
You can use a for loop to run through all the filenames.
#!/bin/bash
for fn in mazeFile*; do
echo "the next file is $fn"
# do something with file $fn
done
See answer here as well: Bash foreach loop
I see you want a backreference to the number in the mazeFile. Thus I recommend John1024's answer.
Edit: removes the unnecessary ls command, per #guido 's comment.

replace $1 variable in file with 1-10000

I want to create 1000s of this one file.
All I need to replace in the file is one var
kitename = $1
But i want to do that 1000s of times to create 1000s of diff files.
I'm sure it involves a loop.
people answering people is more effective than google search!
thx
I'm not really sure what you are asking here, but the following will create 1000 files named filename.n containing 1 line each which is "kite name = n" for n = 1 to n = 1000
for i in {1..1000}
do
echo "kitename = $i" > filename.$i
done
If you have mysql installed, it comes with a lovely command line util called "replace" which replaces files in place across any number of files. Too few people know about this, given it exists on most linux boxen everywhere. Syntax is easy:
replace SEARCH_STRING REPLACEMENT -- targetfiles*
If you MUST use sed for this... that's okay too :) The syntax is similar:
sed -i.bak s/SEARCH_STRING/REPLACEMENT/g targetfile.txt
So if you're just using numbers, you'd use something like:
for a in {1..1000}
do
cp inputFile.html outputFile-$a.html
replace kitename $a -- outputFile-$a.html
done
This will produce a bunch of files "outputFile-1.html" through "outputFile-1000.html", with the word "kitename" replaced by the relevant number, inside the file.
But, if you want to read your lines from a file rather than generate them by magic, you might want something more like this (we're not using "for a in cat file" since that splits on words, and I'm assuming here you'd have maybe multi-word replacement strings that you'd want to put in:
cat kitenames.txt | while read -r a
do
cp inputFile.html "outputFile-$a.html"
replace kitename "$a" -- kitename-$a
done
This will produce a bunch of files like "outputFile-red kite.html" and "outputFile-kite with no string.html", which have the word "kitename" replaced by the relevant name, inside the file.

Resources