running nanofilt with snakemake

running nanofilt with snakemake - bioinformatics

I am new to using snakemake. I want to run my fastq files that are in one folder with nanofilt. I want to run this with snakemake because I need it to create a pipeline. This is my snake make script:
rule NanoFilt:
input:
"data/samples/{sample}.fastq"
output:
"nanofilt_out.gz"
shell:
"gunzip -c {input} | NanoFilt -q 8 | gzip > {output}"
When I run it I get the following error message:
WildcardError in line 2:
Wildcards in input files cannot be determined from output files:
'sample'
EDIT
I tried searching the error message but still couldnt figure out how to make it work. Can anyone help me?
So I tried what people on here told me so the new script is this:
samples = ['fastqrunid4d89b52e7b9734bd797205037ef201a30be415c8293','fastqrunid4d89b52e7b9734bd797205037ef201a30be415c8292','fastqrunid4d89b52e7b9734bd797205037ef201a30be415c8291','fastqrunid4d89b52e7b9734bd797205037ef201a30be415c8290']
rule all:
input:
[f"nanofilt_out_{sample}.gz" for sample in samples]
rule NanoFilt:
input:
"zipped/zipped{sample}.gz"
output:
"nanofilt_out_{sample}.gz"
shell:
"gunzip -c {input} | NanoFilt -q 8 | gzip > {output}"
but when I run this I get the following error message:
Error in rule NanoFilt:
Removing output files of failed job NanoFilt since they might be corrupted:
nanofilt_out_fastqrunid4d89b52e7b9734bd797205037ef201a30be415c8292.gz
jobid: 4
output: nanofilt_out_fastqrunid4d89b52e7b9734bd797205037ef201a30be415c8290.gz
shell:
gunzip -c zipped/zippedfastqrunid4d89b52e7b9734bd797205037ef201a30be415c8290.gz | NanoFilt -q 8 | gzip > nanofilt_out_fastqrunid4d89b52e7b9734bd797205037ef201a30be415c8290.gz
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Does anyone know how to fix this?

The 'idea' of SnakeMake is that you specify what output you want (through for instance rule all), and that SnakeMake takes a look at all the rules defined and knows how to get to the desired output.
When you tell SnakeMake you want as output nanofilt_out.gz, how does it know what sample to take? It doesn't.. if it would just take any of the possible sample files, then we would lose the information about which file it belongs to. To solve this we need in the output also the same wildcard as in the input:
rule NanoFilt:
input:
"data/samples/{sample}.fastq"
output:
"nanofilt_out_{sample}.gz"
shell:
"gunzip -c {input} | NanoFilt -q 8 | gzip > {output}"
This way SnakeMake can make an output for every sample. You do somehow need to adjust the pipeline that you specify which output you want, maybe something like this:
samples = [1,2,3]
rule all:
input:
[f"nanofilt_out_{sample}.gz" for sample in samples]

I made it work. The working code looks like this.
rule NanoFilt:
input:
expand("zipped/zipped.gz", sample=samples)
output:
"filtered/nanofilt_out.gz"
conda:
"envs/nanoFilt.yaml"
shell:
"gunzip -c {input} | NanoFilt -q 6 | gzip > {output}"

Related

wget download files in parallel and rename

I have a text file with two columns: the first column is the name to be saved as, and the second column is the url address to the resource.
10000899567110806314.jpg 'http://lifestyle.inquirer.net/files/2018/07/t0724cheekee-marcopolo_1-e1532358505274-620x298.jpg'
10001149035013559957.jpg 'https://www.politico.eu/wp-content/uploads/2018/07/GettyImages-1004567890.jpg'
10001268622353586394.jpg 'http://www.channelnewsasia.com/image/10549912/16x9/991/529/a7afd249388308118058689b0060a978/Zv/tour-de-france-5.jpg'
10001360495981714191.jpg 'https://media.breitbart.com/media/2018/07/Dan-Coats.jpg'
The file contains thousands of lines, so I wanted a quick way to download and rename these images.
I read multiple posts on SO and came up with this solution:
cat list.txt | xargs -n 1 -P 4 -d '\n' wget -O
Which uses xargs to download in parallel. I want to use wget with -O option to rename the downloaded file. When I run a single wget command, this works well. Example:
wget -O 10000899567110806314.jpg 'http://lifestyle.inquirer.net/files/2018/07/t0724cheekee-marcopolo_1-e1532358505274-620x298.jpg'
but when running the command with xargs to download in parallel, I get this error:
Try `wget --help' for more options.
wget: missing URL
Usage: wget [OPTION]... [URL]...
If I generate a file with just (single col) newline delimited urls and run the following command, it works great.
cat list.txt | xargs -n 1 -P 4 -d '\n' wget
But, I don't want to download the files first and then do the rename operation.

The error you are getting is because you are only passing one argument -n 1 to make it work you need to pass the 2 arguments, try this:
cat list.txt | xargs -n 2 -P 4 wget -O
To use the full line as an argument as #PesaThe suggested you could use option -L 1, for example:
xargs < list.txt -P 4 -L 1 wget -O
From the man:
-L number
Call utility for every number non-empty lines read.
A line ending with a space continues to the next non-empty line.
If EOF is reached and fewer lines have been read than number then utility
will be called with the available lines. The -L and -n options are
mutually-exclusive; the last one given will be used.

Using csvtool call to filter csv in bash

I'm trying to filter a csv by a value in a specific column. My script currently looks like this:
function csv_grep {
if [ $1 == "$SEARCH_TERM" ]
then
echo "$2"
fi
}
export -f csv_grep
VALUES=$(csvtool namedcol col1,col2 dictionary.csv | csvtool call csv_grep -)
However when I run it, I get
/bin/bash: csv_grep: command not found
csv_grep: terminated with exit code 127
I've got version 1.4.2-1 installed so this bug report should not apply.
Any ideas what I'm doing wrong or better approaches to the task at hand?

I've got version 1.4.2-1 installed so this bug report should not apply.
Actually, it looks like you have hit exactly the problem described in that bug report, as we can verify with a simple test. Here's my test environment:
# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.2 LTS"
# dpkg -l csvtool
||/ Name Version Architecture Description
+++-==============-============-============-=================================
ii csvtool 1.4.2-1 amd64 handy command line tool for handl
Let's create and export a function:
testfunction() {
echo column one is "$1"
}
export -f testfunction
And verify that it was successfully exported:
$ bash -c 'testfunction one'
column one is one
Now let's try calling it with csvtool:
$ echo one,two,three | csvtool call testfunction -
/bin/bash: testfunction: command not found
testfunction: terminated with exit code 127
With the bug report in mind, let's take a look at /bin/sh:
$ ls -l /bin/sh
lrwxrwxrwx. 1 root root 4 Oct 3 09:58 /bin/sh -> dash
So /bin/sh is not /bin/bash. Let's fix that:
$ sudo ln -sf /bin/bash /bin/sh
$ ls -l /bin/sh
lrwxrwxrwx. 1 root root 9 Oct 3 10:05 /bin/sh -> /bin/bash
And try csvtool again:
$ echo one,two,three | csvtool call testfunction -
column one is one
Any ideas what I'm doing wrong or better approaches to the task at hand?
I would never try to process csv files in a shell script. I would probably reach for Python and the csv module.

Recommended way to do multiple shell commands with shell()

In snakemake, what is the recommended way to use the shell() function to execute multiple commands?

You can call shell() multiple times within the run block of a rule (rules can specify run: rather than shell:):
rule processing_step:
input:
# [...]
output:
# [...]
run:
shell("somecommand {input} > tempfile")
shell("othercommand tempfile {output}")
Otherwise, since the run block accepts Python code, you could build a list of commands as strings and iterate over them:
rule processing_step:
input:
# [...]
output:
# [...]
run:
commands = [
"somecommand {input} > tempfile",
"othercommand tempfile {output}"
]
for c in commands:
shell(c)
If you don't need Python code during the execution of the rule, you can use triple-quoted strings within a shell block, and write the commands as you would within a shell script. This is arguably the most readable for pure-shell rules:
rule processing_step:
input:
# [...]
output:
# [...]
shell:
"""
somecommand {input} > tempfile
othercommand tempfile {output}
"""
If the shell commands depend on the success/failure of the preceding command, they can be joined with the usual shell script operators like || and &&:
rule processing_step:
input:
# [...]
output:
# [...]
shell:
"command_one && echo 'command_one worked' || echo 'command_one failed'"

Thought I would throw in this example. It maybe isn't a direct answer to the user's question but I came across this question when searching a similar thing and trying to figure out how to run multiple shell commands and run some of them in a particular directory (for various reasons).
To keep things clean you could use a shell script.
Say I have a shell script scripts/run_somecommand.sh that does the following:
#!/usr/bin/env sh
input=$(realpath $1)
output=$(realpath $2)
log=$(realpath $3)
sample="$4"
mkdir -p data/analysis/${sample}
cd data/analysis/${sample}
somecommand --input ${input} --output ${output} 2> ${log}
Then in your Snakemake rule you can do this
rule somerule:
input:
"data/{sample}.fastq"
output:
"data/analysis/{sample}/{sample}_somecommand.json"
log:
"logs/somecommand_{sample}.log"
shell:
"scripts/run_somecommand.sh {input} {output} {log} {sample}"
Note: If you are working on a Mac and don't have realpath you can install with brew install coreutils it's a super handy command.

Check format for Continous integration

I am trying to write a Makefile command that will output an error if the Go code is not correctly formatted. This is for a CI step. I am struggling with how to get it working in the make file. This solution works on the bash command line:
! gofmt -l . 2>&1 | read
But copying this into the Makefile:
ci-format:
#echo "$(OK_COLOR)==> Checking formatting$(NO_COLOR)"
#go fmt ./...
#! gofmt -l . 2>&1 | read
I get the following error:
/bin/sh: 1: read: arg count

These days, I use golangci-lint, which includes gofmt checking as an option.
But if for some reason you want to do this yourself, the command I previously used for precisely that purpose is:
diff -u <(echo -n) <(gofmt -d ./)
See, for example, the .travis.yml files on one of my projects.

How to run a cron job and then email output if two files contents differ?

I'm trying to run a scheduled cron job and email the output to a few users. However, I only want to e-mail the users if something new happened.
Essentially, this is what happens:
I run a python script, it then checks the filename on an FTP server. If the filename is different, it then downloads the file and starts parsing the information. The filename of the previously downloaded file is stored in last.txt - and if it does, indeed, find a new file then it just updates the filename in last.txt
If the filename is the same, it stops processing and just outputs the file is the same.
Essentially, my thoughts were I could do something similar to:
cp last.txt temp.last.txt | python script.py --verbose > report.txt | diff last.txt temp.last.txt
That's where I got stuck, though. Essentially I want to diff the two files, and if they're the same - nothing happens. If they differ, though, I can e-mail the contents of report.txt to a couple of e-mail address via mail command.
Hopefully I was detailed enough, thanks in advance!

First of all, no need for the pipes | in your code, you should issue each command separately.
Either separate them by semicolon or write them on separate lines of the script.
For the problem itself, one solution would be to redirect the output of diff to a report file like:
cp last.txt temp.last.txt
python script.py --verbose > report.txt
diff last.txt temp.last.txt > diffreport.txt
You can then check if the report file is empty or not as described here: http://www.cyberciti.biz/faq/linux-unix-script-check-if-file-empty-or-not/
Based on the result, you can send diffreport.txt and report.txt or just delete all of it.
Here is a quick example for how your cron job script should look like:
#!/bin/bash
# Run the python script
cp last.txt temp.last.txt
python script.py --verbose > report.txt
diff last.txt temp.last.txt > diffreport.txt
# Check if file is empty or not
if [ -s "diffreport.txt" ]
then
# file is not empty, send a mail with the attachment
# May be call another script that will take care of this task.
else
# file is empty, clean up everything
rm diffreport.txt report.txt temp.last.txt
fi

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio