Snakemake and Pandas syntax: Getting sample specific parameters from the sample table - bioinformatics

First off all, this could be a duplicate of Snakemake and pandas syntax. However, I'm still confused so I'd like to explain again.
In Snakemake I have loaded a sample table with several columns. One of the columns is called 'Read1', it contains sample specific read lengths. I would like to get this value for every sample separately as it may differ.
What I would expect to work is this:
rule mismatch_profile:
input:
rseqc_input_bam
output:
os.path.join(rseqc_dir, '{sample}.mismatch_profile.xls')
conda:
"../envs/rseqc.yaml"
params:
read_length = samples.loc['{sample}']['Read1']
shell:
'''
#!/bin/bash
mismatch_profile.py -i {input} -o {rseqc_dir}/{wildcards.sample} -l {params.read_length}
However, that does not work. For some reason I am not allowed to use {sample} inside standard Pandas syntax and I get this error:
KeyError in line 41 of /rst1/2017-0205_illuminaseq/scratch/swo-406/test_snakemake_full/rules/rseqc.smk:
'the label [{sample}] is not in the [index]'
I don't understand why this does not work. I read that I can also use lambda functions but I don't really understand exactly how since they still need {sample} as input.
Could anyone help me?

You could use lambda function
params:
read_length = lambda wildcards: samples.loc[wildcards.sample, 'Read1']

Related

BWA-mem and sambamba read group line error

This is a two-part question:
help interpreting an error;
help with coding.
I'm trying to run bwa-mem and sambamba to aling raw reads to a reference genome and to sort by position. These are the commands I'm using:
bwa mem \
-K 100000000 -v 3 -t 6 -Y \
-R '\#S200031047L1C001R002\S*[1-2]' \
/path/to/reference/GCF_009858895.2_ASM985889v3_genomic.fna \
/path/to/raw-fastq/'S\S[^_]*_L01_[0-9]+-[0-9]+'_1.fq.gz \
/path/to/raw-fastq/'S\S[^_]*_L01_[0-9]+-[0-9]+'_2.fq.gz | \
/path/to/genomics/sambamba-0.8.2 view -S -f bam \
/dev/stdin | \
/path/to/genomics/sambamba-0.8.2 sort \
/dev/stdin \
--out host_removal/${SAMPLE}/${SAMPLE}.hybrid.sorted.bam
This is the error message I'm getting: [E::bwa_set_rg] the read group line is not started with #RG.
My sequences were generated with an MGI sequencer and the readgroups are identified like this #S200031047L1C001R0020000243/1, i.e., they don't beging with an #RG. How can I specify to sambamba that my readgroups start with #S and not #RG?
The commands written above are a published pipeline I'm modifying for my own research. However, among several changes, I'm not confident on how to define sample id as such stated in the last line of the code: --out host_removal/${SAMPLE}/${SAMPLE}.hybrid.sorted.bam (I'm referring to ${SAMPLE}). Any insights?
Thank you very much!
1. Specifying read groups
Your read group string is not correctly formatted. It should be like
'#RG\tID:$ID\tSM:$SM\tLB:$LB\tPU:$PU\tPL:$PL' where the parts beginning with a $ sign should be replaced with the information specific to your sequencing run and sample. Not all of them are required for all purposes. See this read group documentation by GATK team for an example.
Read group specification always begins with #RG. That's part of SAM format. Sequencers do not produce read groups. I think you may be confusing them with fastq header lines. Entries in the read group string are separated by tabs, denoted with \t. Tags and their values are separated by :.
The difference between $ID (read group id) and $SM (sample id) is that sample is the individual or biological sample which may have been sequenced several times in different libraries ($LB). In the GATK documentation they combine flowcell and library into the read group id. Sample and library could make an intuitive read group id in small projects. If you are working on your own project that is not part of a larger sequencing effort, you can define the read groups as you like. If several people work in the same project, you should be consistent to avoid problems later.
2. Variable substitution
I'm not sure if I understood you correctly, but if you are wondering what ${SAMPLE} means in the command, it's a variable called SAMPLE that will be replaced by its value when the command is run. The curly brackets protect the name so that the shell does not confuse the variable name with characters coming after it. See here for examples.

Passing calculation commands to cluster job

TL;DR
Trying to pass a computation of the form $(($LSB_JOBINDEX-1)) to a cluster call, but getting an error
$((2-1)): syntax error: operand expected (error token is "$((2-1))")
How do I escape correctly or what alternative command to use so that this works?
Detailed:
For automatations in my workflow I am currently trying to write a script that automatically issues bsub commands in a predefined order.
Some of these commands are array jobs that are supposed to work on a file each.
If done without the cluster calls, it would look something like this:
samplearray=(sample0.fasta sample1.fasta) #array of input files
for s in samplearray
echo $s #some command on $s
done
for the cluster call I want to use an array job, the command for this looks like this:
bsub -J test[1-2] 'samplearray=(sample0.fastq sample1.fastq)' echo '${samplearray[$(($LSB_JOBINDEX-1))]}'
which launches two jobs with LSB_JOBINDEXset to 1 or 2 respectively, which is why I need to subtract 1 for correct indexing of the array.
The problem now is in the $((...)) part, because what is being executed on the node is ${samplearray[$\(\($LSB_JOBINDEX-1\)\)]} which does not trigger the computation but instead throws an error:
$((2-1)): syntax error: operand expected (error token is "$((2-1))")
What am I doing wrong here? I have tried other ways of escaping and quoting, but this was the closest I got to the correct solution

how to use mongoexport to get a particular format in .csv file?

I am new to mongoDb and exciting about using it at my workplace. However, I have come across a situation where one of our client has sent the data in .bson file. I have got everything working on machine. I want to use mongoexport facility to export my data in csv format. When I am using the following query
./mongoexport --db <dbname> -collection <collectionname> --csv -fields _id,field1,field2
I am getting the result in following format
ObjectID(4f6b42eb5e724242f60002ce),"[ { ""$oid"" : ""4f6b31295e72422cc5000001"" } ]",369008
However, I just want the value of the fields as a comma separated output like below: 4f6b42eb5e724242f60002ce,4f6b31295e72422cc5000001,369008
My question is, is there anything that I can do something in mongoexport to ignore certain characters?
any pointer will be helpful.
No, mongoexport has no features like this. You'll need to use tools like sed and awk to post-process the file, or read the file and munge it in a scripting language like Python.
You should be able to add the following to your list of arguments:
--csv
You may also want to supply a path:
-o something.csv
...Though I don't think you could do this in 2012 when you first posted your question :-)

Export data from mathematica script

I am writing a mathematica script and running it in the linux batch shell. The script gives as a result a list of numbers. I would like to write this list to a file as a one single column without the braces and commas. For this, I tried to use Export comand as
Export["file.txt", A1, "Table"]
but I get the error:
Export::infer: Cannot infer format of file test1.txt
I tried with other format but i got the same error.
Could someone please tell what is wrong and what i can do? Thank beforehand
From what I understand you are trying to export the file in TABLE, why don't you try something like this ,
Export["file.txt", A1, "Text"]
This:
A1 = {1,2,3};
Export["test.tab", Transpose[{A1}], "Table"];
produces a single column without braces and commas.

ls command in UNIX

I have to ls command to get the details of certain types of files. The file name has a specific format. The first two words followed by the date on which the file was generated
e.g.:
Report_execution_032916.pdf
Report_execution_033016.pdf
Word summary can also come in place of report.
e.g.:
Summary_execution_032916.pdf
Hence in my shell script I put these line of codes
DATE=`date +%m%d%y`
Model=Report
file=`ls ${Model}_execution_*${DATE}_*.pdf`
But the value of Model always gets resolved to 'REPORT' and hence I get:
ls: cannot access REPORT_execution_*032916_*.pdf: No such file or directory
I am stuck at how the resolution of Model is happening here.
I can't reproduce the exact code here. Hence I have changed some variable names. Initially I had used the variable name type instead of Model. But Model is the on which I use in my actual code
You've changed your script to use Model=Report and ${Model} and you've said you have typeset -u Model in your script. The -u option to the typeset command (instead of declare — they're synonyms) means "convert the strings assigned to all upper-case".
-u When the variable is assigned a value, all lower-case characters are converted to upper-case. The lower-case attribute is disabled.
That explains the upper-case REPORT in the variable expansion. You can demonstrate by writing:
Model=Report
echo "Model=[${Model}]"
It would echo Model=[REPORT] because of the typeset -u Model.
Don't use the -u option if you don't want it.
You should probably fix your glob expression too:
file=$(ls ${Model}_execution_*${DATE}*.pdf)
Using $(…) instead of backticks is generally a good idea.
And, as a general point, learn how to Debug a Bash Script and always provide an MCVE (How to create a Minimal, Complete, and Verifiable Example?) so that we can see what your problem is more easily.
Some things to look at:
type is usually a reserved word, though it won't break your script, I suggest you to change that variable name to something else.
You are missing an $ before {DATE}, and you have an extra _ after it. If the date is the last part of the name, then there's no point in having an * at the end either. The file definition should be:
file=`ls ${type}_execution_*${DATE}.pdf`
Try debugging your code by parts: instead of doing an ls, do an echo of each variable, see what comes out, and trace the problem back to its origin.
As #DevSolar pointed out you may have problems parsing the output of ls.
As a workaround
ls | grep `date +%m%d%y` | grep "_execution_" | grep -E 'Report|Summary'
filters the ls output afterwards.
touch 'Summary_execution_032916.pdf'
DATE=`date +%m%d%y`
Model=Summary
file=`ls ${Model}_execution_*${DATE}*.pdf`
worked just fine on
GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)
Part of question:
But the value of Model always gets resolved to 'REPORT' and hence I get:
This is due to the fact that in your script you have exported Model=Report
Part of question:
ls: cannot access REPORT_execution_*032916_*.pdf: No such file or directory
No such file our directory issue is due to the additional "_" and additional "*"s that you have put in your 3rd line.
Remove it and the error will be gone. Though, Model will still resolve to Report
Original 3rd line :
file=`ls ${Model}_execution_*${DATE}_*.pdf`
Change it to
file=`ls ${Model}_execution_${DATE}.pdf`
Above change will resolve the could not found issue.
Part of question
I am stuck at how the resolution of Model is happening here.
I am not sure what you are trying to achieve, but if you are trying to populate the file parameter with file name with anything_exection_someDate.pdf, then you can write your script as
DATE=`date +%m%d%y`
file=`ls *_execution_${DATE}.pdf`
If you echo the value of file you will get
Report_execution_032916.pdf Summary_execution_032916.pdf
as the answer
There were some other scripts which were invoked before the control reaches the line of codes which I mentioned in the question. In one such script there is a code
typeset -u Model
This sets the value of the variable model always to uppercase which was the reason this error was thrown
ls: cannot access REPORT_execution_032916_.pdf: No such file or directory
I am sorry that
i couldn't provide a minimal,complete and verifiable code

Resources