Filtering out DNA repeats with vcftools - bioinformatics

I am trying to filter out repeats from DNA sequence reads. For this, I have:
grw_vcf_filtered_2.vcf as the input file,
grw_repeatmasker_runner_combined.bed with the repeat positions I want to filter out
grw_repeats_removed.vcf as the output file I will generate.
However I keep getting the error: grw_repeats_removed.vcf not such file or directory after 4 minutes running.
Here is my code:
module load bioinfo-tools vcftools
cd $SNIC_TMP #making temporary file
cp /proj/snic2020-2-25/nobackup/violeta/grw_vcf_filtered_2.vcf /proj/snic2020-2-25/nobackup/violeta/grw_repeatmasker_runner_combined.bed ./
vcftools --vcf grw_vcf_filtered_2.vcf --out grw_repeats_removed.vcf --exclude-positions grw_repeatmasker_runner_combined.bed
#copy from current location -temporary file- to my directory
cp ./grw_repeats_removed.vcf /proj/snic2020-2-25/nobackup/violeta/

vcftools is deprecated;
use bcftools
use option
-T, --targets-file [^]FILE Similar to -R but streams rather than index-jumps. Exclude regions with "^" prefix
invoke:
bcftools view -O v -o grw_repeats_removed.vcf --targets-file ^grw_repeatmasker_runner_combined.bed grw_vcf_filtered_2.vcf

Related

Creating variables in running script

I'm trying convert some files to read only in backup environment. Data Domain has retention-lock feature that can lock files with external trigger which touch -a -t "dateuntillocked" /backup/foo.
In this situation there is also metadata files in folder that should not be locked otherwise next backup job cannot update metadata file and fails.
I extracted metadata file names but file count can be changed. For exp.
foo1.meta foo2.meta . . fooN.meta
Is it possible to create a variable for each entry and add to command dynamically?
Like:
var1=/backup/foo234.meta
var2=/backup/foo322.meta
.
.
varN=/backup/fooNNN.meta
<find command> | grep -v $var1 $var2....varN | while read line; do touch -a -t "$dateuntillocked" "$line"; done
another elaboration of the case is
for example you executed a ls in a folder but amount of file can differs in time. script will create a variable for every file and use in a touch command with while loop. if 3 files in folder, script will create 3 variable and use 3 variable with touch in while loop. if "ls" result find 4 files, script dynamically create 4 variable fof files and use all in while loop etc. I am not a programmer so my logic can differ. May be another way to do this with easier way.
Just guessing what your intentions might be.
You can combine find | grep | command into a single command:
find /backup -name 'foo*.meta' -exec touch -a -t "$dateuntillocked" {} +

Read a file line by line and run a java program

I need to run a java program that merge multiple files with a *bam extension. the structure of the program is:
java -jar mergefiles.jar \
I=file1.bam \
I=file2.bam \
I=file3.bam \
O=output.bam
So, I am trying the run this program for all *bam files in a directory. Initially, I try to create a list with the names of the *bam files (filenames.txt)
file1.bam
file2.bam
file3.bam
and using the 'while' command, like:
while read -r line; do
java -jar MergeFiles.jar \
I=$line \
O=output.bam
done < filenames.txt
However, the program executed for each *bam file in the text file but not all together (merge only one file per time, and overwrite the output). So, how I can run the program to merge all *bam files recursively?
Also, there are other option in the bash (e.g. using a loop for) to solve this issue?
Thanks in advance.
In your question you specify that you would like to use all .bam files in a dir, so instead of creating a file with the filenames, you should probably use globbing instead. Here's an example:
#! /bin/bash
# set nullglob to be safe
shopt -s nullglob
# read the filenames into an array
files=( *.bam )
# check that files actually exist
if (( ${#files[#]} == 0 )); then echo "no files" && exit 1; fi
# expand the array with a replacement
java -jar MergeFiles.jar \
"${files[#]/#/I=}" \
O=output.bam
The problem with your current solution is that the while loop will only read one line at a time, calling the command on each line separately.

Need loop to delete parts of file name

I have been using an image optimizer for my websites and when I do this, it gives me files with -compressor at the end of it.
input: filename.jpg
output: filename-compressor.jpg
I need help in creating a batch file or a command script that I can just place these files into a folder and it will loop through all of these and change the names of these for me so that I don't have to go through them one by one.
mkdir -p compressors
mv *-compressor.jpg compressors/
cd compressors
for i in *-compressor.jpg; do j=${i%%\-compressor.jpg}.jpg; mv "$i" "$j"; done

Loop to unzip from one directory to individual directories

I am trying to design a loop that implements a lot of single elements I have seen before and the combination is throwing me off. Basically I have a structure like this:
/bin/script.sh
/conf/patches/1.zip
/conf/patches/2.zip
/conf/patches/3.zip
/sharedir
I want a loop that will go through however many patches I have in /conf/patches, unzip each patch into a separate directory in /sharedir. Each directory should be named the name of the file.
What I was trying so far was:
for file in '../conf/patches/*.zip'
do
unzip "${file%%.zip}" -d /sharedir/$file
done
As you can see...there is definitely something I am missing in this combination.
Try this:
for file in /conf/patches/*.zip
do
f="${file##*/}"
mkdir -p "/sharedir/${f%.zip}"
unzip -d "/sharedir/${f%.zip}" "${file}"
done
Remove quotes from glob pattern otherwise it is not expanded:
for file in ../conf/patches/*.zip
do
unzip "${file%%.zip}" -d /sharedir/
done
EDIT: You can try
for f in ../conf/patches/*.zip; do
echo g="${f%%/*}"
unzip -d "sharedir/${g%*.zip}" "$f"
done

bash for command

#!/bin/bash
for i in /home/xxx/sge_jobs_output/split_rCEU_results/*.rCEU.bed
do
intersectBed -a /home/xxx/sge_jobs_output/split_rCEU_results/$i.rCEU.bed -b /home/xxx/sge_jobs_output/split_NA12878_results/$i.NA12878.bed -f 0.90 -r > $i.overlap_90.bed
done
However I got the errors like:
Error: can't determine file type of '/home/xug/sge_jobs_output/split_NA12878_results//home/xug/sge_jobs_output/split_rCEU_results/chr4.rCEU.bed.NA12878.bed': No such file or directory
Seems the computer mixes the two .bed files together, and I don't know why.
thx
Your i has the format /home/xxx/sge_jobs_output/split_rCEU_results/whatever.rCEU.bed, and you insert it to the file name, which leads to the duplication. It's probably simplest to switch to the directory and use basename, like this:
pushd /home/xxx/sge_jobs_output/split_rCEU_results
for i in *.rCEU.bed
do
intersectBed -a $i -b ../../sge_jobs_output/split_NA12878_results/`basename $i .rCEU.bed`.NA12878.bed -f 0.90 -r > `basename $i .NA12878.bed`.overlap_90.bed
done
popd
Notice the use of basename, with which you can replace the extension of a file: If you have a file called filename.foo.bar, basename filename.foo.bar .foo.bar returns just filename.

Resources