Using bash functions in snakemake - bash

I am trying to download some files with snakemake. The files (http://snpeff.sourceforge.net/SnpSift.html#dbNSFP) I would like to download are on a google site/drive and my usual wget approach does not work. I found a bash function that does the job (https://www.zachpfeffer.com/single-post/wget-a-Google-Drive-file):
function gdrive_download () { CONFIRM=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=$1" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p') wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$CONFIRM&id=$1" -O $2 rm -rf /tmp/cookies.txt }
gdrive_download 120aPYqveqPx6jtssMEnLoqY0kCgVdR2fgMpb8FhFNHo test.txt
I have tested this function with my ids in a plain bash script and was able to download all the files. To add a bit to the complexity, I must use a workplace template, and incorporate the function into it.
rule dl:
params:
url = 'ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_{genome}/{afile}'
output:
'data/{genome}/{afile}'
params:
id1 = '0B7Ms5xMSFMYlOTV5RllpRjNHU2s',
f1 = 'dbNSFP.txt.gz'
shell:
"""CONFIRM=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id={{params.id1}}" -O- | sed -rn "s/.*confirm=([0-9A-Za-z_]+).*/\1\n/p") && wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$CONFIRM&id={{params.id1}}" -O {{params.f1}} && rm -rf /tmp/cookies.txt"""
#'wget -c {params.url} -O {output}'
rule checksum:
input:
i = 'data/{genome}/{afile}'
output:
o = temp('tmp/{genome}/{afile}.md5')
shell:
'md5sum {input} > {output}'
rule file_size:
input:
i = 'data/{genome}/{afile}'
output:
o = temp('tmp/{genome}/{afile}.size')
shell:
'du -csh --apparent-size {input} > {output}'
rule file_info:
"""md5 checksum and file size"""
input:
md5 = 'tmp/{genome}/{afile}.md5',
s = 'tmp/{genome}/{afile}.size'
output:
o = temp('tmp/{genome}/info/{afile}.csv')
run:
with open(input.md5) as f:
md5, fp = f.readline().strip().split()
with open(input.s) as f:
size = f.readline().split()[0]
with open(output.o, 'w') as fout:
print('filepath,size,md5', file=fout)
print(f"{fp},{size},{md5}", file=fout)
rule manifest:
input:
expand('tmp/{genome}/info/{suffix}.csv', genome=('GRCh37','GRCh38'), suffix=('dbNSFP.txt.gz', 'dbNSFP.txt.gz.tbi'))
#expand('tmp/{genome}/info/SnpSift{suffix}.csv', genome=('GRCh37','GRCh38'), suffix=('dbNSFP.txt.gz', 'dbNSFP.txt.gz.tbi'))
output:
o = 'MANIFEST.csv'
run:
pd.concat([pd.read_csv(afile) for afile in input]).to_csv(output.o, index=False)
There are four downloadable files for which I have ids (I only show one in params), however I don't know how to call the bash functions as written by ZPfeffer for all the ids I have with snakemake. Additionally, when I run this script, there are several errors, the most pressing being
sed: -e expression #1, char 31: unterminated `s' command
I am far from a snakemake expert, any assistance on how to modify my script to a) call the functions with 4 different ids, b) remove the sed error, and c) verify whether this is the correct url format (currently url = 'https://docs.google.com/uc?export/{afile}) will be greatly appreciated.

You would want to use raw string literal so that snakemake doesn't escape special characters, such as backslash in sed command. For example (notice r in front of shell command):
rule foo:
shell:
r"sed d\s\"
You could use --printshellcmds or -p to see how exactly shell: commands get resolved by snakemake.

Here is how I "solved" it:
import pandas as pd
rule dl:
output:
'data/{genome}/{afile}'
shell:
"sh download_snpsift.sh"
rule checksum:
input:
i = 'data/{genome}/{afile}'
output:
o = temp('tmp/{genome}/{afile}.md5')
shell:
'md5sum {input} > {output}'
rule file_size:
input:
i = 'data/{genome}/{afile}'
output:
o = temp('tmp/{genome}/{afile}.size')
shell:
'du -csh --apparent-size {input} > {output}'
rule file_info:
"""md5 checksum and file size"""
input:
md5 = 'tmp/{genome}/{afile}.md5',
s = 'tmp/{genome}/{afile}.size'
output:
o = temp('tmp/{genome}/info/{afile}.csv')
run:
with open(input.md5) as f:
md5, fp = f.readline().strip().split()
with open(input.s) as f:
size = f.readline().split()[0]
with open(output.o, 'w') as fout:
print('filepath,size,md5', file=fout)
print(f"{fp},{size},{md5}", file=fout)
rule manifest:
input:
expand('tmp/{genome}/info/{suffix}.csv', genome=('GRCh37','GRCh38'), suffix=('dbNSFP.txt.gz', 'dbNSFP.txt.gz.tbi'))
output:
o = 'MANIFEST.csv'
run:
pd.concat([pd.read_csv(afile) for afile in input]).to_csv(output.o, index=False)
And here is the bash script.
function gdrive_download () {
CONFIRM=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=$1" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$CONFIRM&id=$1" -O $2
rm -rf /tmp/cookies.txt
}
gdrive_download 0B7Ms5xMSFMYlSTY5dDJjcHVRZ3M data/GRCh37/dbNSFP.txt.gz
gdrive_download 0B7Ms5xMSFMYlOTV5RllpRjNHU2s data/GRCh37/dbNSFP.txt.gz.tbi
gdrive_download 0B7Ms5xMSFMYlbTZodjlGUDZnTGc data/GRCh38/dbNSFP.txt.gz
gdrive_download 0B7Ms5xMSFMYlNVBJdFA5cFZRYkE data/GRCh38/dbNSFP.txt.gz.tbi

Related

Qsub script - Unable to run job: Script length does not match declared length

I have a script that submits processing jobs to a queue. Before I submit the jobs, I assign the string variables to each respective data point so I can use them as the arguments before I submit the jobs through qsub.
I had to fix up the module I'm loading first by putting in a -v variable to set up my working environment. I got the error message that is in the title however, and looking around there is very limited resources to debugging it. One resource I found seems to have led me in the direction of the potential likelihood of an extraneous space in the qsub command itself. Has anyone run into this?
I also did echo on my qsub command to make sure it was being inputted correctly, as it was.
Here's my script:
#!/bin/bash
# This script is for submitting the initial registration subjects for Greedy registration.
# It can serve as a template for later studies when multiple submissions could be handy
# GO_HOME = Origin diqrectory for all niftis of interest
GO_NIFTI="/gpfs/fs001/medorg/comp_space/myname/Test-Retest/Nifti/"
GO_B0="/gpfs/fs001/medorg/comp_space/myname/Test-Retest/Protocols/ants_SyNBaseline/W_Registration_antsSyN_Baseline7/"
GO_FM="/gpfs/fs001/medorg/comp_space/myname/Test-Retest/Protocols/brainmage_batch_t1/"
FINAL_DESTINATION="/gpfs/fs001/cbica/comp_space/wingerti/Test-Retest/Protocols/Registration_greedy_Rigid/"
cd $GO_NIFTI
nii_directories=($(find . -type d -name "*t1*" -o -name "*t0*" -o -name "*t2*" -maxdepth 1 ))
module load greedy
# Will look at these subjects individually, taking them out list to not run DTI_Preprocess
unset nii_directories[27] # 1000009_t0_test
unset nii_directories[17] # 1000001_t0
unset nii_directories[4] # 1000009_t2
# With directories, navigate into each, and find where the suitable niis are (31dir and 33dir)
for g in "${nii_directories[#]}";
do
# Subject ID argument
subjid=${g:2:9}
echo "$subjid is the subject ID..."
# -i argument (T1 NIFTI File and DTI)
cd $GO_NIFTI
nii_is=$(find $subjid -type f -name ${subjid}_T1.nii.gz)
nii_i=${GO_NIFTI}${nii_is}
cd $GO_B0
cd $subjid
GO_B0_2=$PWD
b0_is=$(find . -type f -name b0.nii.gz)
b0_i=${GO_B0_2}${b0_is}
echo "-i arguments for $subjid is $nii_i and $b0_i"
# -m argument (Mask File)
#cd $GO_DTI
#mask_ms=$(find $subjid -type f -name ${subjid}_tensor_mask.nii.gz)
#mask_m=${GO_DTI}${mask_ms}
#echo "-m argument for $subjid is $mask_m"
# -fm argument (T1 mask)
cd $GO_FM
mask_fms=$(find $subjid -type f -name ${subjid}_t1_brain_mask.nii.gz)
mask_fm=${GO_FM}${mask_fms}
echo "-fm argument for $subjid is $mask_fm"
# -o argument (Working Directory for possible debugging and tmp dir organization among experiments)
cd $FINAL_DESTINATION
g=${FINAL_DESTINATION:73:-1}
experiment_name="${subjid}_${g}"
mkdir $experiment_name
output_o=${FINAL_DESTINATION}${experiment_name}/${experiment_name}_rigid.txt
echo "-o argument for $g is $output_o"
#
printf "\nSubmitting the following command: \n
qsub -m beas -M myname#medschool.edu -N Registration_${experiment_name} "$(which greedy)" -d3 -i $nii_i $b0_i -o $output_o -a -m MI -n 100x100 -fm $mask_fm dof 6 -ia-identity\n
as JobID: Registration_${experiment_name}\n\n"
qsub -v /medorg/software/external/greedy/centos7/c6dca2e -m beas -M myname#medschool.edu -N Registration_${experiment_name} "$(which greedy)" -d3 -i $nii_i $b0_i -o $output_o -a -m MI -n 100x100 -fm $mask_fm dof 6 -ia-identity
# --- Above line submits Greedy Rigid jobs (dof 6) with
# --- "-m" for emailing updates on jobs, inbox sorts job submission emails
# --- "-N" names the job for book-keeping
cd $GO_NIFTI
done

Makefile: command line argument -e of echo is passed to the file

In Makefile, when I write to file using echo -e "text" >, -e is also passed:
APIM_5 = echo -e "[Desktop Entry]\nName=$(MAIN)\nExec=$(MAIN)\nIcon=$(MAIN)\nType=Application\nVersion=1.0\nCategories=Utility;" > AppDir/usr/share/applications/$(MAIN).desktop;
But the file I echo into ($(MAIN).desktop) looks like below:
-e [Desktop Entry]
Name=main
Exec=main
Icon=main
Type=Application
Version=1.0
Categories=Utility;
All definitions together and how I call them:
APIM_1 = cd output;
APIM_2 = $(RM) AppDir appimage-build;
APIM_3 = mkdir -p AppDir/usr/bin AppDir/usr/share/applications AppDir/usr/share/icons/hicolor/256x256/apps/ AppDir/usr/lib;
APIM_4 = touch AppDir/usr/share/applications/$(MAIN).desktop;
APIM_5 = echo -e "[Desktop Entry]\nName=$(MAIN)\nExec=$(MAIN)\nIcon=$(MAIN)\nType=Application\nVersion=1.0\nCategories=Utility;" > AppDir/usr/share/applications/$(MAIN).desktop;
APIM_6 = cp $(MAIN) AppDir/usr/bin/;
APIM_7 = cp ../meta/icon/$(MAIN).png AppDir/usr/share/icons/hicolor/256x256/apps/;
APIM_8 = appimage-builder --skip-test;
appimage: all
$(APIM_1) $(APIM_2) $(APIM_3) $(APIM_4) $(APIM_5) $(APIM_6) $(APIM_7) $(APIM_8)
#echo Executing 'appimage' complete!
What causes this?

Multiple Processes - Python

I am looking to run multiple instances of a command line script at the same time. I am new to this concept of "multi-threading" so am at bit of a loss as to why I am seeing the things that I am seeing.
I have tried to execute the sub-processing in two different ways:
1 - Using multiple calls of Popen without a communicate until the end:
command = 'raster2pgsql -I -C -e -s 26911 %s -t 100x100 -F p839.%s_image_sum_sum1 | psql -U david -d projects -h pg3' % (workspace + '\\r_sumsum1{}'.format(i), str(i))
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
command = 'raster2pgsql -I -C -e -s 26911 %s -t 100x100 -F p839.%s_image_sum_sum2 | psql -U david -d projects -h pg3' % (workspace + '\\r_sumsum2{}'.format(i), str(i))
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
command = 'raster2pgsql -I -C -e -s 26911 %s -t 100x100 -F p839.%s_image_sum_sum3 | psql -U david -d projects -h pg3' % (workspace + '\\r_sumsum3{}'.format(i), str(i))
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
(stdoutdata, stderrdata) = process.communicate()
this starts up each of the command line item but only completes the last entry leaving the other 2 hanging.
2 - Attempting to implement an example from Python threading multiple bash subprocesses? but nothing happens except for a printout of the commands (program hangs with no command line arguments running as observed in windows task manager:
import threading
import Queue
import commands
import time
workspace = r'F:\Processing\SM'
image = 't08r_e'
image_name = (image.split('.'))[0]
i = 0
process_image_tif = workspace + '\\{}{}.tif'.format((image.split('r'))[0], str(i))
# thread class to run a command
class ExampleThread(threading.Thread):
def __init__(self, cmd, queue):
threading.Thread.__init__(self)
self.cmd = cmd
self.queue = queue
def run(self):
# execute the command, queue the result
(status, output) = commands.getstatusoutput(self.cmd)
self.queue.put((self.cmd, output, status))
# queue where results are placed
result_queue = Queue.Queue()
# define the commands to be run in parallel, run them
cmds = ['raster2pgsql -I -C -e -s 26911 %s -t 100x100 -F p839.%s_image_sum_sum1 | psql -U david -d projects -h pg3' % (workspace + '\\r_sumsum1{}'.format(i), str(i)),
'raster2pgsql -I -C -e -s 26911 %s -t 100x100 -F p839.%s_image_sum_sum2 | psql -U david -d projects -h pg3' % (workspace + '\\r_sumsum2{}'.format(i), str(i)),
'raster2pgsql -I -C -e -s 26911 %s -t 100x100 -F p839.%s_image_sum_sum3 | psql -U david -d projects -h pg3' % (workspace + '\\r_sumsum3{}'.format(i), str(i)),
]
for cmd in cmds:
thread = ExampleThread(cmd, result_queue)
thread.start()
# print results as we get them
while threading.active_count() > 1 or not result_queue.empty():
while not result_queue.empty():
(cmd, output, status) = result_queue.get()
print(cmd)
print(output)
How can I run all of these commands at the same time achieving a result at the end? I am running in windows, pyhton 2.7.
My first try didn't work because of the repeated definitions of stdout and sterror. Removing these definitions causes expected behavior.

Replace a text and IP address inside a file

I would like to write a script that reads a text file that has all the nodes listed in there:
node1
node2
node3
.
.
.
It creates a .conf file for each node in the
/etc/icinga2/zones.d/master/hosts/new/ directory
Copies the content of the file name windows-template into each
new conf file.
Then finds the phrase "hostname.hostdomain.com" in each conf file
and replaces that with the filename minus the .conf. So for example,
for node1, I will have node1.conf in which there is a phrase
"hostname.hostdomain.com" which needs to be replaced with node1
Then pings the hostname which is technically the filename minus
".conf" and replaces the 10.20.20.1 with the correct hostname.
I tried wrirting the script and part 1 and 2 work, part 3 works too but it replaces the hostname.hostdomain.com with "$f" which is not right. And I have no clue how to do number 4.
Can you please help?
Thank you
This is my windows-template.conf file:
object Host "hostname.hostdomain.com" {
import "production-host"
check_command = "hostalive"
address = "10.20.20.1"
vars.client_endpoint = name
vars.disks["disk C:"] = {
disk_partition = "C:"
}
vars.os = "Windows"
}
object Zone "hostname.hostdomain.com" {
endpoints = [ "hostname.hostdomain.com" ];
parent = "master";
}
object Endpoint "hostname.hostdomain.com" {
host = "10.20.20.1"
}
And this is my script:
#!/bin/bash
cd /etc/icinga2/zones.d/master/hosts/new
while read f; do
cp -v "$f" /etc/icinga2/zones.d/master/hosts/new/"$f.conf"
cp windows-template.conf "$f.conf"
chown icinga:icinga "$f.conf"
sed -i 's/hostname.hostdomain.com/$f/g' "$f.conf"
# git add "$f.conf"
# git commit -m "Add $f"
done < windows-list.txt
Thank you
You need double quotes for the shell to expand your variable. Try
sed -i "s/hostname.hostdomain.com/$f/g" "$f.conf"
Does this work for you?
#!/bin/bash
cd /etc/icinga2/zones.d/master/hosts/new
while read f; do
cp -v "$f" /etc/icinga2/zones.d/master/hosts/new/"$f.conf"
cp windows-template.conf "$f.conf"
chown icinga:icinga "$f.conf"
sed -i "s/hostname.hostdomain.com/$f/g" "$f.conf"
hostname=$( ssh -o StrictHostKeyChecking=no "username#$f" -n "hostname" )
mv "$f.conf" "${hostname}.conf"
# git add "$f.conf"
# git commit -m "Add $f"
done < windows-list.txt
Where username is your username, and I assume you copy your pub key to the hosts.

Use wildcard on params

I try to use one tool and I need to use a wildcard present on input.
This is an example:
aDict = {"120":"121" } #tumor : normal
rule all:
input: expand("{case}.mutect2.vcf",case=aDict.keys())
def get_files_somatic(wildcards):
case = wildcards.case
control = aDict[wildcards.case]
return [case + ".sorted.bam", control + ".sorted.bam"]
rule gatk_Mutect2:
input:
get_files_somatic,
output:
"{case}.mutect2.vcf"
params:
genome="ref/hg19.fa",
target= "chr12",
name_tumor='{case}'
log:
"logs/{case}.mutect2.log"
threads: 8
shell:
" gatk-launch Mutect2 -R {params.genome} -I {input[0]} -tumor {params.name_tumor} -I {input[1]} -normal {wildcards.control}"
" -L {params.target} -O {output}"
I Have this error:
'Wildcards' object has no attribute 'control'
So I have a function with case and control. I'm not able to extract code.
The wildcards are derived from the output file/pattern. That is why you only have the wildcard called case. You have to derive the control from that. Try replacing your shell statement with this:
run:
control = aDict[wildcards.case]
shell(
"gatk-launch Mutect2 -R {params.genome} -I {input[0]} "
"-tumor {params.name_tumor} -I {input[1]} -normal {control} "
"-L {input.target2} -O {output}"
)
You could define control in params. Also {input.target2} in shell command would result in error. May be it's supposed to be params.target?
rule gatk_Mutect2:
input:
get_files_somatic,
output:
"{case}.mutect2.vcf"
params:
genome="ref/hg19.fa",
target= "chr12",
name_tumor='{case}',
control = lambda wildcards: aDict[wildcards.case]
shell:
"""
gatk-launch Mutect2 -R {params.genome} -I {input[0]} -tumor {params.name_tumor} \\
-I {input[1]} -normal {params.control} -L {params.target} -O {output}
"""

Resources