snakemake manage pair sample and indelrealigner - bioinformatics

I want to connect the realigner process with the indel reallignement.
This is the rules:
rule gatk_IndelRealigner:
input:
tumor="mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",
normal="mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",
id="mapped_reads/merged_samples/operation/{tumor}_{normal}.realign.intervals"
output:
"mapped_reads/merged_sample/CoClean/{tumor}.sorted.dup.reca.cleaned.bam",
"mapped_reads/merged_sample/CoClean/{normal}.sorted.dup.reca.cleaned.bam",
params:
genome=config['reference']['genome_fasta'],
mills= config['mills'],
ph1_indels= config['know_phy'],
log:
"mapped_reads/merged_samples/logs/{tumor}.indel_realign_2.log"
threads: 8
shell:
"gatk -T IndelRealigner -R {params.genome} "
"-nt {threads} "
"-I {input.tumor} -I {input.normal} -known {params.ph1_indels} -known {params.mills} -nWayOut .cleaned.bam --maxReadsInMemory 500000 --noOriginalAligmentTags --targetIntervals {input.id} >& {log} "
This is the error:
Not all output files of rule gatk_IndelRealigner contain the same wildcards.
I suppose I need to use also the {tumor}_{normal} but I can't use.
Snakemake:
rule all:
input:expand("mapped_reads/merged_samples/CoClean/{sample}.sorted.dup.reca.cleaned.bam",sample=config['samples']),
expand("mapped_reads/merged_samples/operation/{sample[1][tumor]}_{sample[1][normal]}.realign.intervals", sample=read_table(config["conditions"], ",").iterrows())
config.yml
conditions: "conditions.csv"
conditions.csv
tumor,normal
411,412
Here you can see an example of the code (for testing purpose) gave the same error:
directory
$ tree prova/
prova/
├── condition.csv
├── config.yaml
├── output
│   ├── ABC.bam
│   ├── pippa.bam
│   ├── Pippo.bam
│   ├── TimBorn.bam
│   ├── TimNorm.bam
│   ├── TimTum.bam
│   └── XYZ.bam
└── Snakefile
this is snakemake
$ cat prova/Snakefile
from pandas import read_table
configfile: "config.yaml"
rule all:
input:
expand("{pathDIR}/{sample[1][tumor]}_{sample[1][normal]}.bam", pathDIR=config["pathDIR"], sample=read_table(config["sampleFILE"], " ").iterrows()),
expand("CoClean/{sample[1][tumor]}.bam", sample=read_table(config["sampleFILE"], " ").iterrows()),
expand("CoClean/{sample[1][normal]}.bam", sample=read_table(config["sampleFILE"], " ").iterrows())
rule gatk_RealignerTargetCreator:
input:
"{pathGRTC}/{normal}.bam",
"{pathGRTC}/{tumor}.bam",
output:
"{pathGRTC}/{tumor}_{normal}.bam"
# wildcard_constraints:
# tumor = '[^_|-|\/][0-9a-zA-Z]*',
# normal = '[^_|-|\/][0-9a-zA-Z]*'
run:
call('touch ' + str(wildcard.tumor) + '_' + str(wildcard.normal) + '.bam', shell=True)
rule gatk_IndelRealigner:
input:
t1="output/{tumor}.bam",
n1="output/{normal}.bam",
output:
"CoClean/{tumor}.sorted.dup.reca.cleaned.bam",
"CoClean/{normal}.sorted.dup.reca.cleaned.bam",
log:
"mapped_reads/merged_samples/logs/{tumor}.indel_realign_2.log"
threads: 8
shell:
"gatk -T IndelRealigner -R {params.genome} "
"-nt {threads} -I {input.t1} -I {input.n1} & {log} "
conditions.csv
$ more condition.csv
tumor normal
TimTum TimBorn
XYZ ABC
Pippo pippa
Thanks for any suggestion

I'm not convinced you have to include two input files to the GATK IndelRealigner. Building from that assumption, you can alter the rule to become indifferent to the "type (tumor vs normal)" of file it is process. I read the specs here. Please, if I am wrong, stop reading and correct me.
rule gatk_IndelRealigner:
input:
inputBAM="output/{sampleGATKIR}.bam",
output:
"CoClean/{sampleGATKIR}.sorted.dup.reca.cleaned.bam",
log:
"mapped_reads/merged_samples/logs/{sampleGATKIR}.indel_realign_2.log"
params:
genome="**DONT FORGET TO ADD THIS""
threads: 8
shell:
"gatk -T IndelRealigner -R {params.genome} "
"-nt {threads} -I {input.inputBAM} & {log} "
By changing the rule to be bam-type agnostic (made up word) you gain two advantages, and there is one main disadvantage.
Advantages:
Now we only have a single wild-card
We can run the alignment of each .bam file independently, which with a devoted CPU should hopefully make things faster.
Disadvantage:
We are now likely putting two copies of the genome onto memory somewhere, since the threads are now being run as separate processes, no more memory sharing of the genome file. (In my previous position, hardware availability wasn't typically an issue, so I heavily am biased towards splitting everything up)
The reason I think that the GATK documentation has it setup to accept multiple 'bam' files is because if you are just using it as a 1-off call you want to list all the files at the same time. We are not needing that since we are automating the call process. We're indifferent to 1 call or 100 calls.

Related

Windows tree command that follows .lnk links

so i have a folder structure that i want to make a tree of (im currently using the tree command), the current output looks like this:
C:.
└───example
├───example2
└───folder with link
│ link to example 2.lnk
│
└───other folder
would it be possible to show the destiny of the link?
example:
C:.
└───example
├───example2
└───folder with link
│ link to example 2.lnk -> example2
│
└───other folder
it doesn't have to look exactly like that, i just want to see the link destination
i tried to find something on the internet but the only thing i found, was a linux solution that looked like this
tree -l
.
├── aaa
│ └── bbb
│ └── ccc
└── slink -> /home/kaa/test/aaa/bbb
sadly -l or /l doesn't exist in windows

Windows Equivalent to `sha256sum -c` (cryptographic hash, digest file, recursive integrity check, SHA256SUMS)

What is the equivalent to sha256sum -c in Windows?
I have a set of very important files that I need to copy-to and mirror across many different types of disks in many geographically distinct locations. After relaying the contents to disk via USB, ethernet, fiber, radio, telegram, and signal fires (some of which are more reliable means of transmissions than others!), I want to check the integrity of the data written to disk.
In Debian Linux, file checksums are typically stored in a SHA256SUM "digest" file that's generated using the sha256sum command. It's trivial to use this command to generate this file with the recursive SHA256 checksums of all the files in the current directory and subdirectories. It's also very trivial for the user to use this command to verify the integrity of all the files, recursively. For example, consider this super-critical dataset of cat pictures
user#disp3274:~/Pictures$ tree
.
├── cats
│   ├── cat1.jpeg
│   ├── cat2.jpeg
│   └── cat3.jpeg
└── people
├── person1.jpeg
└── person2.jpeg
2 directories, 5 files
user#disp3274:~/Pictures$
I can generate the checksum file as follows
user#disp3274:~/Pictures$ time sha256sum `find . -type f` > SHA256SUMS
real 0m0.010s
user 0m0.008s
sys 0m0.002s
user#disp3274:~/Pictures$
user#disp3274:~/Pictures$ cat SHA256SUMS
b2d82e7b8dcbaef4d06466bee3486c12467ce5882e2eabe735319a90606f206a ./people/person2.jpeg
e01f7b240f300ce629c07502639a670d9665e82df6cba9311b87ba3ad23c595d ./people/person1.jpeg
53e056cc91fd4157880fb746255a2f621ebee8ca6351a659130d6228142c1e47 ./cats/cat1.jpeg
a0a73a21b9d26f1bbe4fcfce0acd21964dedf2dc247a5fe99bd9f304aa137379 ./cats/cat2.jpeg
a171fa88d431a531960b6eb312d964ed66cc35afd64bde5dda9b929ad83343f6 ./cats/cat3.jpeg
user#disp3274:~/Pictures$
And I can verify the integrity of all the files as follows
user#disp3274:~/Pictures$ time sha256sum -c SHA256SUMS
./people/person2.jpeg: OK
./people/person1.jpeg: OK
./cats/cat1.jpeg: OK
./cats/cat2.jpeg: OK
./cats/cat3.jpeg: OK
real 0m0.009s
user 0m0.008s
sys 0m0.000s
user#disp3274:~/Pictures$
In Windows, what is the equivalent built-in tool for generating a SHA256SUMS (or similar digest file using another cryptographic hash function) and verifying the integrity of a set of files, recursively?
There is no direct equivalent of the SHA256SUMS tool but PowerShell can easily generate a (SHA256) hash of a file or files using the Get-FileHash cmdlet.
If you want to call Get-FileHash for a files in a folder you can combine it with Get-ChildItem. e.g. Get-ChildItem | Get-FileHash or recursively: Get-ChildItem -Recurse | Get-FileHash

How do I reach end of file in less terminal without ...skipping

If I have tree output in terminal with less with this function
function tre() {
tree -aC -I '.git|node_modules|bower_components' --dirsfirst "$#" | less -FRNX;
}
, it will scroll 1 line by pressing key each time.
I need a shorcut or command to reach and of file.
If I press "G" the output would be with "...skipping..."
19 │ │ │ └── someotherfile.db
20 │ │ ├── static
...skipping...
62 │ │ ├── user
63 │ │ │ ├── admin.py
How do I get to the end of file with all lines loaded without "...skipping..."?
The issue was with this
less -FRNX;
The last (X) forced the output line by line. So the solution was not to use it
less -FRN;
(Why I use less for tree output)
On screenshot below is the difference between default tree output and output it with less. Same folder, but with less output is with colors, line numbers and directory first.
enter image description here

How to view full dependency tree for nested Go dependencies

I'm trying to debug the following build error in our CI where "A depends on B which can't build because it depends on C." I'm building my data service which doesn't directly depend on kafkaAvailMonitor.go which makes this error hard to trace. In other words:
data (what I'm building) depends on (?) which depends on
kafkaAvailMonitor.go
It may seem trivial to fix for a developer they just do "go get whatever" but I can't do that as part of the release process - I have to find the person that added the dependency and ask them to fix it.
I'm aware that there are tools to visualize the dependency tree and other more sophisticated build systems, but this seems like a pretty basic issue: is there any way I can view the full dependency tree to see what's causing the build issue?
go build -a -v
../../../msgq/kafkaAvailMonitor.go:8:2: cannot find package
"github.com/Shopify/sarama/tz/breaker" in any of:
/usr/lib/go-1.6/src/github.com/Shopify/sarama/tz/breaker (from $GOROOT)
/home/jenkins/go/src/github.com/Shopify/sarama/tz/breaker (from $GOPATH)
/home/jenkins/vendor-library/src/github.com/Shopify/sarama/tz/breaker
/home/jenkins/go/src/github.com/Shopify/sarama/tz/breaker
/home/jenkins/vendor-library/src/github.com/Shopify/sarama/tz/breaker
When using modules you may be able to get what you need from go mod graph.
usage: go mod graph
Graph prints the module requirement graph (with replacements applied)
in text form. Each line in the output has two space-separated fields: a module
and one of its requirements. Each module is identified as a string of the form
path#version, except for the main module, which has no #version suffix.
I.e., for the original question, run go mod graph | grep github.com/Shopify/sarama then look more closely at each entry on the left-hand side.
if the following isn't a stack trace what is it?
It is the list of path where Go is looking for your missing package.
I have no idea who is importing kafkaAvailMonitor.go
It is not "imported", just part of your sources and compiled.
Except it cannot compile, because it needs github.com/Shopify/sarama/tz/breaker, which is not in GOROOT or GOPATH.
Still, check what go list would return on your direct package, to see if kafkaAvailMonitor is mentioned.
go list can show both the packages that your package directly depends, or its complete set of transitive dependencies.
% go list -f '{{ .Imports }}' github.com/davecheney/profile
[io/ioutil log os os/signal path/filepath runtime runtime/pprof]
% go list -f '{{ .Deps }}' github.com/davecheney/profile
[bufio bytes errors fmt io io/ioutil log math os os/signal path/filepath reflect run
You can then script go list in order to list all dependencies.
See this bash script for instance, by Noel Cower (nilium)
#!/usr/bin/env bash
# Usage: lsdep [PACKAGE...]
#
# Example (list github.com/foo/bar and package dir deps [the . argument])
# $ lsdep github.com/foo/bar .
#
# By default, this will list dependencies (imports), test imports, and test
# dependencies (imports made by test imports). You can recurse further by
# setting TESTIMPORTS to an integer greater than one, or to skip test
# dependencies, set TESTIMPORTS to 0 or a negative integer.
: "${TESTIMPORTS:=1}"
lsdep_impl__ () {
local txtestimps='{{range $v := .TestImports}}{{print . "\n"}}{{end}}'
local txdeps='{{range $v := .Deps}}{{print . "\n"}}{{end}}'
{
go list -f "${txtestimps}${txdeps}" "$#"
if [[ -n "${TESTIMPORTS}" ]] && [[ "${TESTIMPORTS:-1}" -gt 0 ]]
then
go list -f "${txtestimps}" "$#" |
sort | uniq |
comm -23 - <(go list std | sort) |
TESTIMPORTS=$((TESTIMPORTS - 1)) xargs bash -c 'lsdep_impl__ "$#"' "$0"
fi
} |
sort | uniq |
comm -23 - <(go list std | sort)
}
export -f lsdep_impl__
lsdep_impl__ "$#"
I just want to mention here that go mod why can also help. Anyway you cannot get and display the whole tree. But you can trace back one single branch of a child dependency until its parent root.
Example:
$ go mod why github.com/childdep
# github.com/childdep
github.com/arepo.git/service
github.com/arepo.git/service.test
github.com/anotherrepo.git/mocks
github.com/childdep
That means, you have imported 'childdep' finally in 'anotherrepo.git/mocks'.
can try this https://github.com/vc60er/deptree
redis git:(master) go mod graph | deptree -d 3
package: github.com/go-redis/redis/v9
dependence tree:
┌── github.com/cespare/xxhash/v2#v2.1.2
├── github.com/dgryski/go-rendezvous#v0.0.0-20200823014737-9f7001d12a5f
├── github.com/fsnotify/fsnotify#v1.4.9
│ └── golang.org/x/sys#v0.0.0-20191005200804-aed5e4c7ecf9
├── github.com/nxadm/tail#v1.4.8
│ ├── github.com/fsnotify/fsnotify#v1.4.9
│ │ └── golang.org/x/sys#v0.0.0-20191005200804-aed5e4c7ecf9
│ └── gopkg.in/tomb.v1#v1.0.0-20141024135613-dd632973f1e7
├── github.com/onsi/ginkgo#v1.16.5
│ ├── github.com/go-task/slim-sprig#v0.0.0-20210107165309-348f09dbbbc0
│ │ ├── github.com/davecgh/go-spew#v1.1.1
│ │ └── github.com/stretchr/testify#v1.5.1
│ │ └── ...
The above answer still doesn't show me a dependency tree so I've taken the time to write a Python script to do what I need - hopefully that helps other people.
The issue with the above solution (the others proposed like go list) is that it only tells me the top level. They don't "traverse the tree." This is the output I get - which doesn't help any more than what go build gives me.
.../npd/auth/
.../mon/mlog
.../auth/service
This is what I'm trying to get - I know that auth is broken (top) and that breaker is broken (bottom) from go build but I have no idea what's in between - my script below gives me this output.
.../npd/auth/
.../npd/auth/service
.../npd/auth/resource
.../npd/auth/storage
.../npd/middleware
.../npd/metrics/persist
.../npd/kafka
.../vendor-library/src/github.com/Shopify/sarama
.../vendor-library/src/github.com/Shopify/sarama/vz/breaker
My Python script:
import subprocess
import os
folder_locations=['.../go/src','.../vendor-library/src']
def getImports(_cwd):
#When the commands were combined they overflowed the bugger and I couldn't find a workaround
cmd1 = ["go", "list", "-f", " {{.ImportPath}}","./..."]
cmd2 = ["go", "list", "-f", " {{.Imports}}","./..."]
process = subprocess.Popen(' '.join(cmd1), cwd=_cwd,shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
out1, err = process.communicate()
process = subprocess.Popen(' '.join(cmd2), cwd=_cwd,shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
out2, err = process.communicate()
out2clean=str(out2).replace("b'",'').replace('[','').replace(']','').replace("'",'')
return str(out1).split('\\n'),out2clean.split('\\n')
def getFullPath(rel_path):
for i in folder_locations:
if os.path.exists(i+'/'+rel_path):
return i+'/'+rel_path
return None
def getNextImports(start,depth):
depth=depth+1
indent = '\t'*(depth+1)
for i,val in enumerate(start.keys()):
if depth==1:
print (val)
out1,out2=getImports(val)
noDeps=True
for j in out2[i].split(' '):
noDeps=False
_cwd2=getFullPath(j)
new_tree = {_cwd2:[]}
not_exists = (not _cwd2 in alltmp)
if not_exists:
print(indent+_cwd2)
start[val].append(new_tree)
getNextImports(new_tree,depth)
alltmp.append(_cwd2)
if noDeps:
print(indent+'No deps')
_cwd = '/Users/.../npd/auth'
alltmp=[]
start_root={_cwd:[]}
getNextImports(start_root,0)
The dependency of a Go project is a directional graph. This graph consists of multiple layers, ranging from several to hundreds or thousands. Here is an dependency graph of redis. The cascaded tree can be difficult to understand due to the presence of many duplicated subtrees. To make the layout easier to view, the tree can be flattened via gomoddeps to fit the width of the screen.
tzhang:~/github.com/redis/go-redis$ go mod graph | gomoddeps
├─ github.com/redis/go-redis/v9
│ └─ dependencies
│ ├─ github.com/bsm/ginkgo/v2#v2.5.0
│ ├─ github.com/bsm/gomega#v1.20.0
│ ├─ github.com/cespare/xxhash/v2#v2.2.0
│ ├─ github.com/davecgh/go-spew#v1.1.1
│ ├─ github.com/dgryski/go-rendezvous#v0.0.0-20200823014737-9f7001d12a5f
│ ├─ github.com/pmezard/go-difflib#v1.0.0
│ ├─ github.com/stretchr/testify#v1.8.1
│ └─ gopkg.in/yaml.v3#v3.0.1
│
├─ github.com/bsm/ginkgo/v2#v2.5.0
│ └─ dependents
│ └─ github.com/redis/go-redis/v9
│
├─ github.com/bsm/gomega#v1.20.0
│ └─ dependents
│ └─ github.com/redis/go-redis/v9
│
├─ github.com/cespare/xxhash/v2#v2.2.0
│ └─ dependents
│ └─ github.com/redis/go-redis/v9
│
├─ github.com/davecgh/go-spew#v1.1.1
│ └─ dependents
│ ├─ github.com/redis/go-redis/v9
│ ├─ github.com/stretchr/testify#v1.8.1
│ ├─ github.com/stretchr/testify#v1.8.0
│ └─ github.com/stretchr/objx#v0.4.0
...

How to capture shell script program output from cucumber/aruba?

I want to capture output which I'm running from the cucumber feature file.
I created one shell script program and placed it in /usr/local/bin/ so it can be accessible from anywhere in system.
abc_qa.sh -
arg=$1
if [[ $arg = 1 ]]
then
echo $(date)
fi
project structure of cucumber -
aruba -
.
├── features
│ ├── support
│ │ └── env.rb
│ └── use_aruba_cucumber.feature
├── Gemfile
Gemfile -
source 'https://rubygems.org'
gem 'aruba', '~> 0.14.2'
env.rb -
require 'aruba/cucumber'
use_aruba_cucumber.feature -
Feature: Cucumber
Scenario: First Run
When I run `bash abc_qa.sh 1`
I want to capture this abc_qa.sh program output in the cucumber itself and compare this date is right or wrong by using any kind of simple test and make this test as a pass.
You can use %x(command) to get the stdout of your command.
You can then use Time.parse to convert "Sat Nov 5 12:04:18 CET 2016" to 2016-11-05 12:04:18 +0100 as Time object, and compare it to Time.now :
require 'time'
time_from_abc_script = Time.parse(%x(bash abc_qa.sh 1))
puts (Time.now-time_from_abc_script).abs < 5 # No more than 5s difference between 2 times
You could use this boolean in any test file you want.
For example :
In features/use_aruba_with_cucumber.feature :
Feature: Cucumber
Scenario: First Run
When I run `bash abc_qa.sh 1`
Then the output should be the current time
and features/step_definitions/time_steps.rb :
require 'time'
Then(/^the output should be the current time$/) do
time_from_script = Time.parse(last_command_started.output)
expect(time_from_script).to be_within(5).of(Time.now)
end

Resources