RNA-seq STAR alignment error in reading fastq files - bash

I am writing a script to use the STAR aligner to map fastq files to a reference genome. Here is my code:
#!/bin/bash
#$ -N DT_STAR
#$ -l mem_free=200G
#$ -pe openmp 8
#$ -q bio,abio,pub8i
module load STAR/2.5.2a
cd /dfs1/bio/dtatarak/DT-advancement_RNAseq_stuff/RNAseq_10_4_2017
mkdir David_data1
STAR --genomeDir /dfs1/bio/dtatarak/indexes/STAR_Index --readFilesIn /dfs1/bio/dtatarak/DT-advancement_RNAseq_stuff/RNAseq_10_4_2017/DT_1_read1.fastq
/dfs1/bio/dtatarak/DT-advancement_RNAseq_stuff/RNAseq_10_4_2017/DT_1_read2.fastq --runThreadN 8 --outFileNamePrefix "David_data1/DT_1"`
I keep getting this error message
EXITING because of fatal input ERROR: could not open readFilesIn=/dfs1/bio/dtatarak/DT-advancement_RNAseq_stuff/RNAseq_10_4_2017/DT_1_read1.fastq
Does anyone have experience using STAR? I cannot figure out why it isn't able to open my read files.

The second space character between STAR and --genomeDir is a syntax error. There should be only one.
Another thing is the argument --outFileNamePrefix "David_data1/DT_1"
Are you sure, that it takes a path, which is in quotes? Also you have to create the directory DT_1 within David_data1 first, if you didn't do so already manually. Also there always have to be a / in front of the paths.
--outFileNamePrefix /David_data1/DT_1/
Besides, are there any subdirectories in your STAR_Index folder? Because I always have to set the genomDir argument like this:
--genomeDir path/to/STAR_index/STARindex/hg38/
The message is known to come up, after syntax errors, so I hope it works, if you try it something like this:
#!/bin/bash
#$ -N DT_STAR
#$ -l mem_free=200G
#$ -pe openmp 8
#$ -q bio,abio,pub8i
module load STAR/2.5.2a
cd /dfs1/bio/dtatarak/DT-advancement_RNAseq_stuff/RNAseq_10_4_2017
mkdir David_data1
cd David_data1
mkdir DT_1
cd ..
STAR --genomeDir /dfs1/bio/dtatarak/indexes/STAR_Index --readFilesIn /dfs1/bio/dtatarak/DT-advancement_RNAseq_stuff/RNAseq_10_4_2017/DT_1_read1.fastq
/dfs1/bio/dtatarak/DT-advancement_RNAseq_stuff/RNAseq_10_4_2017/DT_1_read2.fastq --runThreadN 8 --outFileNamePrefix /David_data1/DT_1/

Related

Creating a job-array from text file for Bash

I'm trying to create a job array to run simultaneously taking each line from the text file "somemore.txt" where I have the directories of some files I want to run through the program "fastqc". This is the script:
#!/bin/bash
#$ -S /bin/bash
#$ -N QC
#$ -cwd
#$ -l h_vmem=24G
cd /emc/cbmr/users/czs772/
FILENAME=$(sed -n $SGE_TASK_ID"p" somemore.txt)
/home/czs772/FastQC/fastqc $FILENAME -outdir /emc/cbmr/users/czs772/marcQC
but I get the error: "No such file or directory"
instead, if I run the code through a for loop I have no error:
for name in $(cat /emc/cbmr/users/czs772/somemore.txt)
do /home/czs772/FastQC/fastqc $name -outdir /emc/cbmr/users/czs772/marcQC
done
So it makes me think that the mistake is the script code and not the directory, but I can't make it to work. I've also tried to open the file with "cat" but again, it didn't work.
Any idea why?
Problem solved!
I typed "cat -vet" to see hidden characters:
cat -vet 2fastqc.sh
#!/bin/bash^M$
#$ -S /bin/bash^M$
#$ -N FastQC^M$
#$ -cwd^M$
#$ -pe smp 1^M$
#$ -l h_vmem=12G^M$
^M$
cd /emc/cbmr/users/czs772/^M$
FILENAME=$(sed -n $SGE_TASK_ID"p" somemore.txt)^M$
/home/czs772/FastQC/fastqc $FILENAME -outdir marcQC
Which showed an "^M" at the end of each line, I just discovered this is something that may happen when writing scripts in windows. It can be solved:
from the code editor program (I use Sublime text): by selecting the code, View tab -> Line Endings -> Unix (instead of Windows)
from the server by typing: dos2unix [name of the script.sh]
Thanks for your comments!

Creating separate output file per input file

I'm using kofamscan by KEGG to annotate bunch of fasta files.I'm running this with multiple fasta files so whenever new file is being analyzed the output file is being overwritten. I really want separate output files per input file(i.e. a.fasta -> a.txt; b.fasta -> b.txt, etc.) and I have tried the following but it seems to be not working:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -pe def_slot 8
#$ -N coral_kofam
#$ -o stdout
#$ -e stderr
#$ -l os7
# perform kofam operation from file 1 to file 47
#$ -t 1-47:1
#$ -tc 10
#setting
source ~/.bash_profile
readarray -t files < kofam_files #input files
TASK_ID=$((SGE_TASK_ID - 1))
~/kofamscan/bin/exec_annotation -o kofam_out_[$TASK_ID].txt --tmp-dir $(mktemp -d) ${files[$TASK_ID]}
The following section of the code is where I need to change(obviously as it is not working for me now)
-o kofam_out_[$TASK_ID].txt
Could anybody help me how to make this work?
Do you want to name output file with $TASK_ID?
Just put file name like this kofam_out_${TASK_ID}.txt

How to save the output files in the corresponding folders

I have many allsamples.bam files in different folders and I want to extract unmapped reads from all of them and save it as unmapped.bam in the corresponding folders, how to do it? allbamfiles.txt contains the paths to all my bam files.
#!/usr/bin/env bash
#$ -q cluster
#$ -cwd
#$ -N test
#$ -e /path/to/log
#$ -o /path/to/log
#$ -l job_mem=8G
#$ -pe serial 4
SAMTOOLS="/path/to/samtools"
while IFS= read -r file
do
$SAMTOOLS view -b -f 4 $file > "${file%.bam}_unmapped.bam"
done < "/path/to/allbamfiles.txt"
wait
Assuming that the paths of all files in allbamfiles.txt are refered to the current directory or are absolute paths this solution should work.
Notice that the dirname command gets the path of the file and the basename command gets the file name.
SAMTOOLS="/path/to/samtools"
while read file; do
dir=$(dirname $file)
fileName=$(basename $file)
$SAMTOOLS view -b -f 4 $file > "${dir}/${fileName%.bam}_unmapped.bam"
done < "/path/to/allbamfiles.txt"

Variable Not getting recognized in shell script

I use the following shell script to run a simulation on my cluster.
#PBS -N 0.05_0.05_m_1_200k
#PBS -l nodes=1:ppn=1,pmem=1000mb
#PBS -S /bin/bash
#$ -m n
#$ -j oe
FOLDER= 0.57
WDIR=/home/vikas/ala_1_free_energy/membrane_200k/restraint_decoupling_pullinit_$FOLDER
cd /home/vikas/ala_1_free_energy/membrane_200k/restraint_decoupling_pullinit_$FOLDER
LAMBDA= 0.05
/home/durba/gmx455/bin/mdrun -np 1 -deffnm md0.05 -v
############################
Now my problem is that my script doesn't recognize variable FOLDER and throws an error
couldn't find md0.05.tpr
which exist in the folder. If I write 0.57 at the place of $folder,It works fine, which makes me feel that it's not recognizing the variable FOLDER. LAMBDA is recognized perfectly in both of the cases.If somebody can help me here, I will be grateful.
There should not be a space between the = and the value you wish to assign to the variables:
FOLDER="0.57"
WDIR="/home/vikas/ala_1_free_energy/membrane_200k/restraint_decoupling_pullinit_$FOLDER"
cd "/home/vikas/ala_1_free_energy/membrane_200k/restraint_decoupling_pullinit_$FOLDER"
LAMBDA="0.05"
/home/durba/gmx455/bin/mdrun -np 1 -deffnm md0.05 -v
############################
All of the double quotes "" I added are not strictly necessary for this example, however it is good practice to get into using them.

Sun Grid Engine: name output file using value stored in variable

Thanks in advance for the help.
I am trying to pass a job using
qsub -q myQ myJob.sh
in myJob.sh I have
# Name of the output log file:
temp=$( date +"%s")
out="myPath"
out=$out$temp
#$ -v out
#$ -o $out
unset temp
unset out
What I want is for my output file to have standard name with the unix timestamp appended to the end such as myOutputFile123456789
When I run this, my output file is named literally "$out" rather than myOutputFile123456789. Is it possible to do what I want and if so how might I do it?
You can't set -o or -e programtically inside the script. What you can do is point them at /dev/null then redirect inside the script. Assuming you want the timestamp to be the time the job ran and the jobscript is a bourne shell script (including bash,ksh,zsh scripts) then the following should work
#$ -o /dev/null
exec >myPath$(date +"%s")
You'll be throwing away any output from the prolog/epilog though.

Resources