Using R with in command bash terminal - bash

I have a set of files *.txt in a specific directory. I have written an .r file code called SampleStatus.r which contains a unique function that reads, proceeses data and writes the results to an output file.
The function is like:
format_windpro(import_file="in.txt", export_file="out.txt")
I would like to use bash commands to read and compute every file in one command using my R file.

Use Rscript. Example code:
for f in ${INPUT_DIR}/*.txt; do \
base=$(basename $f) \
Rscript SampleStatus.R $f ${OUTPUT_DIR}/$base \
done
While in your SampleStatus.R you handle command line arguments like this:
#!/usr/bin/env Rscript
# ...
argv <- commandArgs(T)
# error checking...
import_file <- argv[1]
export_file <- argv[2]
# your function call
format_windpro(import_file, export_file)

Related

Syntax conflict for "{" using Nextflow

New to nextflow, attempted to run a loop in nextflow chunk to remove extension from sequence file names and am running into a syntax error.
params.rename = "sequences/*.fastq.gz"
workflow {
rename_ch = Channel.fromPath(params.rename)
RENAME(rename_ch)
RENAME.out.view()
}
process RENAME {
input:
path read
output:
stdout
script:
"""
for file in $baseDir/sequences/*.fastq.gz;
do
mv -- '$file' '${file%%.fastq.gz}'
done
"""
}
Error:
- cause: Unexpected input: '{' # line 25, column 16.
process RENAME {
^
Tried to use other methods such as basename, but to no avail.
Inside a script block, you just need to escape the Bash dollar-variables and use double quotes so that they can expand. For example:
params.rename = "sequences/*.fastq.gz"
workflow {
RENAME()
}
process RENAME {
debug true
"""
for fastq in ${baseDir}/sequences/*.fastq.gz;
do
echo mv -- "\$fastq" "\${fastq%%.fastq.gz}"
done
"""
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [crazy_brown] DSL2 - revision: 71ada7b0d5
executor > local (1)
[71/4321e6] process > RENAME [100%] 1 of 1 ✔
mv -- /path/to/sequences/A.fastq.gz /path/to/sequences/A
mv -- /path/to/sequences/B.fastq.gz /path/to/sequences/B
mv -- /path/to/sequences/C.fastq.gz /path/to/sequences/C
Also, if you find escaping the Bash variables tedious, you may want to consider using a shell block instead.

Nextflow: Missing output file(s) expected by process

I'm currently making a start on using Nextflow to develop a bioinformatics pipeline. Below, I've created a params.files variable which contains my FASTQ files, and then input this into fasta_files channel.
The process trimming and its scripts takes this channel as the input, and then ideally, I would output all the $sample".trimmed.fq.gz into the output channel, trimmed_channel. However, when I run this script, I get the following error:
Missing output file(s) `trimmed_files` expected by process `trimming` (1)
The nextflow script I'm trying to run is:
#! /usr/bin/env nextflow
params.files = files("$baseDir/FASTQ/*.fastq.gz")
println "fastq files for trimming:$params.files"
fasta_files = Channel.fromPath(params.files)
println "files in the fasta channel: $fasta_files"
process trimming {
input:
file fasta_file from fasta_files
output:
path trimmed_files into trimmed_channel
// the shell script to be run:
"""
#!/usr/bin/env bash
mkdir trimming_report
cd /home/usr/Nextflow
#Finding and renaming my FASTQ files
for file in FASTQ/*.fastq.gz; do
[ -f "\$file" ] || continue
name=\$(echo "\$file" | awk -F'[/]' '{ print \$2 }') #renaming fastq files.
sample=\$(echo "\$name" | awk -F'[.]' '{ print \$1 }') #renaming fastq files.
echo "Found" "\$name" "from:" "\$sample"
if [ ! -e FASTQ/"\$sample"_trimmed.fq.gz ]; then
trim_galore -j 8 "\$file" -o FASTQ #trim the files
mv "\$file"_trimming_report.txt trimming_report #moves to the directory trimming report
else
echo ""\$sample".trimmed.fq.gz exists skipping trim galore"
fi
done
trimmed_files="FASTQ/*_trimmed.fq.gz"
echo \$trimmed_files
"""
}
The script in the process works fine. However, I'm wondering if I'm misunderstanding or missing something obvious. If I've forgot to include something, please let me know and any help is appreciated!
Nextflow does not export the variable trimmed_files to its own scope unless you tell it to do so using the env output qualifier, however doing it that way would not be very idiomatic.
Since you know the pattern of your output files ("FASTQ/*_trimmed.fq.gz"), simply pass that pattern as output:
path "FASTQ/*_trimmed.fq.gz" into trimmed_channel
Some things you do, but probably want to avoid:
Changing directory inside your NF process, don't do this, it entirely breaks the whole concept of nextflow's /work folder setup.
Write a bash loop inside a NF process, if you set up your channels correctly there should only be 1 task per spawned process.
Pallie has already provided some sound advice and, of course, the right answer, which is: environment variables must be declared using the env qualifier.
However, given your script definition, I think there might be some misunderstanding about how best to skip the execution of previously generated results. The cache directive is enabled by default and when the pipeline is launched with the -resume option, additional attempts to execute a process using the same set of inputs, will cause the process execution to be skipped and will produce the stored data as the actual results.
This example uses the Nextflow DSL 2 for my convenience, but is not strictly required:
nextflow.enable.dsl=2
params.fastq_files = "${baseDir}/FASTQ/*.fastq.gz"
params.publish_dir = "./results"
process trim_galore {
tag { "${sample}:${fastq_file}" }
publishDir "${params.publish_dir}/TrimGalore", saveAs: { fn ->
fn.endsWith('.txt') ? "trimming_reports/${fn}" : fn
}
cpus 8
input:
tuple val(sample), path(fastq_file)
output:
tuple val(sample), path('*_trimmed.fq.gz'), emit: trimmed_fastq_files
path "${fastq_file}_trimming_report.txt", emit: trimming_report
"""
trim_galore \\
-j ${task.cpus} \\
"${fastq_file}"
"""
}
workflow {
Channel.fromPath( params.fastq_files )
| map { tuple( it.getSimpleName(), it ) }
| set { sample_fastq_files }
results = trim_galore( sample_fastq_files )
results.trimmed_fastq_files.view()
}
Run using:
nextflow run script.nf \
-ansi-log false \
--fastq_files '/home/usr/Nextflow/FASTQ/*.fastq.gz'

How to parallel process a function, with loops

So I have this function, I want this function to run everything that It contains in itself at the same time. So far it isn't working, and according to other sources, this is how you do it. The function itself works if its not in parallel.
#!/bin/bash
foo () {
cd ${HOME}/sh/path/to/script/execute
for f in *.sh; do #goes to "execute" directory and executes all
#scripts the current directory "execute" basically run-parts without cron
cd ~/sh/path/to/script
while IFS= read -r l1 #Line 1 in master.txt
IFS= read -r l2 #Line 2 in master.txt
IFS= read -r l3 #Line 3 in master.txt
do
cd /dev/shm/arb
echo ${l1} > arg.txt & echo ${l2} > arg2.txt & echo ${l3} > arg3.txt
cd ${HOME}/sh/path/to/script/execute
bash -H ${f} #executes all scripts inside "execute" folder
cd ~/sh/path/to/script/here
./here.sh &
cd ~/sh/path/to/script &
done <master.txt
done
}
export -f foo
parallel ::: foo
Results in
#No result at all....., just buffers. htop doesn't acknowledge any
#processes, and when this runs its pretty taxing on the cores.
master.txt content
In case this is relevant:
apple_fruit
apple_veggie
veggie_fruit
#apple changes
pear_fruit
pear_veggie
veggie_fruit
#pear changes
cucumber_fruit
...
I'm very new to using parallel, and don't know how it works in advanced(and basic) situations so would the loops interfere? And if it does interfere, is there a workaround?
The result is probably going to be something like:
inner() {
script="$1"
parallel -N3 "'$script' {}; here.sh {}" :::: master.txt
}
export -f inner
parallel inner ::: ${HOME}/sh/path/to/script/execute/*.sh
This will call each of the scripts in ${HOME}/sh/path/to/script/execute/ (and here.sh) with 3 arguments from master.txt like this:
${HOME}/sh/path/to/script/execute/script1.sh apple_fruit apple_veggie veggie_fruit
You need to change the scripts so that:
They get the arguments from the command line (not from arg.txt, arg2.txt, arg3.txt).
They send their output to stdout

How can I write multiple lines in expect program for the spawn command?

I have written this little script for getting multiple files from my remote server to my host computer:
#! /usr/bin/expect -f
spawn scp \
user#remote:/home/user/{A.txt,B.txt} \
/home/user_local/Documents
expect "password: "
send "somesecretpwd\r"
interact
This is working fine, but when I want to make new lines between the files like this:
user#remote:/home/user/{A.txt,\
B.txt} \
I am getting the following error(s):
scp: /home/user/{A.txt,: No such file or directory
scp: B.txt}: No such file or directory
I tried this:
user#remote:"/home/user/{A.txt,\
B.txt}" \
getting:
bash: -c: line 0: unexpected EOF while looking for matching `"'
bash: -c: line 1: syntax error: unexpected end of file
cp: cannot stat 'B.txt}"': No such file or directory
or this:
"user#remote:/home/user/{A.txt,\
B.txt}" \
getting the same error at the beginning.
How can I write the files in multiple lines but so that the program is working correctly? I need this for a better readability of the choosen files.
Edit:
Only changed the local user name to user_local
In Tcl (and so Expect) \<NEWLINE><SPACEs> will be converted into one single <SPACE> so you cannot write a string containing no spaces into multiple lines.
% puts "abc\
def"
abc def
% puts {abc\
def}
abc def
%
Assuming the filenames are really longer (not much point otherwise) you could use a couple of variables like this:
#! /usr/bin/expect -f
set A A.txt
set B B.txt
spawn scp \
user#remote:/home/user/{$A,$B} \
/home/user/Documents
expect "password: "
send "somesecretpwd"
interact
For anyone who want to solve a similar problem with only using expect:
You can write a list of files and then concat all files to one string.
Here is the code:
#! /usr/bin/expect -f
set files {\ # a list of files
A.txt\
B.txt\
C.txt\
}
# will return the concatenated string with all files
# in this example it would be: A.txt,B.txt,C.txt
set concat [join $files ,]
# self made version of concat
# set concat [lindex $files 0] # get the first file
# set last_idx [expr {[llength $files]-1}] # calc the last index from the list
# set rest_files [lrange $files 1 $last_idx] # get other files
# foreach file $rest_files {
# set concat $concat,$file # append the concat varibale with a comma and the other file
# }
# # puts "$concat" # only for testing the output
spawn scp \
user#remote:/home/doublepmcl/{$concat} \
/home/user_local/Documents
expect "password: "
send "somesecretpwd\r"
interact

Bash Scripting : How to loop over X number of files, take input and write to a file in the same line

So I have a program written in C that takes in some parameters: calling it allcell
some sample parameters: -m 1800 -n 9
the files being analyzed: cfdT100-0.trj, cfdT100-1.trj, cfdT100-2.trj, cfdT100-3.trj, ... cfdT100-19.trj
file being fed: template.file
out file: result.file
$ allcell -m 1800 -n 9 cfdT100-[0-19].trj < template.file > result.file
But when I htop, I see that only cfdT100-0.trj, cfdT100-1.trj and cfdT100-9.trj are being read. How do I make the shell read all the files from 0-19 ?
Additionally, when I write a script file to automate this, how should I enclose the line? Will this work:
"$($ allcell -m 1800 -n 9 cfdT100-[0-19].trj < template.file > result.file)"
I believe you want to change your glob expression to cfdT100-{0..19}.trj instead.
neech#nicolaw.uk:~ $ echo cfdT100-{0..19}.trj
cfdT100-0.trj cfdT100-1.trj cfdT100-2.trj cfdT100-3.trj cfdT100-4.trj cfdT100-5.trj cfdT100-6.trj cfdT100-7.trj cfdT100-8.trj cfdT100-9.trj cfdT100-10.trj cfdT100-11.trj cfdT100-12.trj cfdT100-13.trj cfdT100-14.trj cfdT100-15.trj cfdT100-16.trj cfdT100-17.trj cfdT100-18.trj cfdT100-19.trj
Your quoting on the scripted version looks acceptable. Just change the glob.
use recursion function for infinite loop
a()
{
echo "apple"
a
}
a
This the will make a infinite loop

Resources