Snakemake conda env parameter is not taken from config.yaml file - bioinformatics

I use a conda env that I create manually, not automatically using Snakemake. I do this to keep tighter version control.
Anyway, in my config.yaml I have the following line:
conda_env: '/rst1/2017-0205_illuminaseq/scratch/swo-406/snakemake'
Then, at the start of my Snakefile I read that variable (reading variables from config in your shell part does not seem to work, am I right?):
conda_env = config['conda_env']
Then in a shell part I hail said parameter like this:
rule rsem_quantify:
input:
os.path.join(fastq_dir, '{sample}_R1_001.fastq.gz'),
os.path.join(fastq_dir, '{sample}_R2_001.fastq.gz')
output:
os.path.join(analyzed_dir, '{sample}.genes.results'),
os.path.join(analyzed_dir, '{sample}.STAR.genome.bam')
threads: 8
shell:
'''
#!/bin/bash
source activate {conda_env}
rsem-calculate-expression \
--paired-end \
{input} \
{rsem_ref_base} \
{analyzed_dir}/{wildcards.sample} \
--strandedness reverse \
--num-threads {threads} \
--star \
--star-gzipped-read-file \
--star-output-genome-bam
'''
Notice the {conda_env}. Now this gives me the following error:
Could not find conda environment: None
You can list all discoverable environments with `conda info --envs`.
Now, if I change {conda_env} for its parameter directly /rst1/2017-0205_illuminaseq/scratch/swo-406/snakemake, it does work! I don't have any trouble reading other parameters using this method (like rsem_ref_base and analyzed_dir in the example rule above.
What could be wrong here?
Highest regards,
Freek.

The pattern I use is to load variables into params, so something along the lines of
rule rsem_quantify:
input:
os.path.join(fastq_dir, '{sample}_R1_001.fastq.gz'),
os.path.join(fastq_dir, '{sample}_R2_001.fastq.gz')
output:
os.path.join(analyzed_dir, '{sample}.genes.results'),
os.path.join(analyzed_dir, '{sample}.STAR.genome.bam')
params:
conda_env=config['conda_env']
threads: 8
shell:
'''
#!/bin/bash
source activate {params.conda_env}
rsem-calculate-expression \
...
'''
Although, I'd also never do this with a conda environment, because Snakemake has conda environment management built-in. See this section in the docs on Integrated Package Management for details. This makes reproducibility much more manageable.

Related

Helm: Expanding environment variables in values.yaml with helm install or upgrade

In our Jenkins pipeline, I'm using a bash script to call the helm install command. We have a values.yaml containing most of the values to be passed to helm. However, few values are based upon environment variables and have to be passed using the --set argument. Here is the snippet:
helm install $RELEASE_NAME shared/phoenixmsp-app -f value.yaml \
--set global.env.production=$production \
--set global.cluster.hosts=${CONFIG[${CLUSTER_NAME}]} \
--set nameOverride=$RELEASE_NAME \
--set fullnameOverride=$RELEASE_NAME \
--set image.repository=myhelm.hub.mycloud.io/myrepo/mainservice \
--set-string image.tag=$DOCKER_TAG \
--wait --timeout 180s --namespace $APP_NAMESPACE"
We want to move these --set parameters to values.yaml. The goal is to get rid of --set and simply pass the values.yaml.
Question: Is it possible to expand Environment Variables in values.yaml while calling with helm install or helm upgrade?
The only way I think you can do that, if you really want to use a single yaml is to have a template values.yaml and either sed the values into it or use a templating language like jinja or mustache, then feed the resulting output into helm.
--set is a good solution here, but if you really don't want that, dynamically write a second values file for the run-time values.
echo "
global:
env:
production: $production
cluster:
hosts: ${CONFIG[${CLUSTER_NAME}]}
nameOverride: $RELEASE_NAME
fullnameOverride: $RELEASE_NAME
image:
repository: myhelm.hub.mycloud.io/myrepo/mainservice
tag: $DOCKER_TAG
" > runtime.yaml
helm install $RELEASE_NAME shared/phoenixmsp-app -f value.yaml -f runtime.yaml \
--wait --timeout 180s --namespace $APP_NAMESPACE
This really does nothing but slightly reduce the precedence, though.
If all these values are known ahead of time, maybe build your runtime.yaml in advance and throw it into a git repo people can peer-review before deployment time, and just use the variables to select the right file from the repo.

How to set variables using terragrunt before_hook

I need to use some gcloud commands in order to create a Redis instance on GCP as terraform does not support some options that I need.
I'm trying this:
terraform {
# Before apply, run script.
before_hook "create_redis_script" {
commands = ["apply"]
execute = ["REDIS_REGION=${local.module_vars.redis_region}", "REDIS_PROJECT=${local.module_vars.redis_project}", "REDIS_VPC=${local.module_vars.redis_vpc}", "REDIS_PREFIX_LENGHT=${local.module_vars.redis_prefix_lenght}", "REDIS_RESERVED_RANGE_NAME=${local.module_vars.redis_reserved_range_name}", "REDIS_RANGE_DESCRIPTION=${local.module_vars.redis_range_description}", "REDIS_NAME=${local.module_vars.redis_name}", "REDIS_SIZE=${local.module_vars.redis_size}", "REDIS_ZONE=${local.module_vars.redis_zone}", "REDIS_ALT_ZONE=${local.module_vars.redis_alt_zone}", "REDIS_VERSION=${local.module_vars.redis_version}", "bash", "../../../scripts/create-redis-instance.sh"]
}
The script is like this:
echo "[+]Creating IP Allocation Automatically using <$REDIS_VPC-network\/$REDIS_PREFIX_LENGHT>"
gcloud compute addresses create $REDIS_RESERVED_RANGE_NAME \
--global \
--purpose=VPC_PEERING \
--prefix-lenght=$REDIS_PREFIX_LENGHT \
--description=$REDIS_RANGE_DESCRIPTION \
--network=$REDIS_VPC
The error I get is:
terragrunt apply
5b35d0bf15d0a0d61b303ed32556b85417e2317f
5b35d0bf15d0a0d61b303ed32556b85417e2317f
5b35d0bf15d0a0d61b303ed32556b85417e2317f
ERRO[0002] Hit multiple errors:
Hit multiple errors:
exec: "REDIS_REGION=us-east1": executable file not found in $PATH
ERRO[0002] Unable to determine underlying exit code, so Terragrunt will exit with error code 1
I encountered the same issue and resigned myself to pass the values as parameters instead of environment variables.
It involves to modify the script and is a far less clearer declaration, but it works :|

Run ansible-lint through subdirectories within a gitlab role

I am trying to add a validation step to a gitlab repo holding a single ansible role (with no playbook).
The structure of the role looks like :
.gitlab-ci.yml
tasks/
templates/
files/
vars/
handlers/
With the gitlab-ci looking like :
stages:
- lint
job-lint:
image:
name: cytopia/ansible-lint:latest
entrypoint: ["/bin/sh", "-c"]
stage: lint
script:
- ansible-lint --version
- ansible-lint . -x 106 tasks/*.yml
I need to skip the naming rule, thus ignoring rule 106.
Otherwise, I would like all files at the root repo to be checked. Since there is no playbook, lint has to be given the files that need to be checked... or at least, that is what I understoodd : I may have this point wrong. But anyway, if I give no name, lint does return ok but actually performs no check.
My problem is that I don't know how to tell him to check all the yaml in a recursive way, or even within a subdirectory. The above code returns an error :
ansible-lint: error: unrecognized arguments: tasks/deploy.yml tasks/localhost.yml tasks/main.yml tasks/managedata.yml tasks/psqlconf.yml
Any idea on how to check all the files from a subdirectory or through the whole role?
PS : I am using cytopia image for ansible-lint, but I have no problem using another, provided it's hosted on dockerhub.
You should certainly be able to pass multiple YAML files as arguments to ansible-lint. I have version 4.1.1a0, and I'm able to use it like this, for example:
anisble-lint -x 106 roles/*/tasks/*.yml
I notice that you seem to have placed a . before your -x 106; that looks like an error. It doesn't look like ansible-lint will accept a directory name as an argument (it doesn't cause it to fail; it just doesn't accomplish anything).
I've tried this both with a locally installed ansible-lint and using the cytopia/ansible-lint image, which appears to perform identically:
docker run --rm -v $PWD:/src -w /src cytopia/ansible-lint -x 106 roles/*/tasks/*.yml
If you want to check all the yaml files, you can use find with exec option, something like this:
find ./ -not -name ".gitlab-ci.yml" -name "*.yml" | xargs ansible-lint -x 106
However ansible-lint -x 106 ./ should work, are you sure that your role really has errors? I've tested it both on ansible-galaxy init generated roles (with meta and all that stuff) and roles which were containing only tasks directory, and it worked every time.
EDIT: I tried creating an error in existing role, replacing "present" with "latest" in package install task
$ ansible-galaxy install geerlingguy.nfs
$ cd ~/.ansible/roles/geerlingguy.nfs
$ sed -i "s/present/latest/g" tasks/setup-RedHat.yml
$ ansible-lint ./
Examining tasks/main.yml of type tasks
Examining tasks/setup-Debian.yml of type tasks
Examining tasks/setup-RedHat.yml of type tasks
Examining handlers/main.yml of type handlers
Examining meta/main.yml of type meta
[403] Package installs should not use latest
tasks/setup-RedHat.yml:2
Task/Handler: Ensure NFS utilities are installed.
and it actually worked, so you may want to run a verbose output to see if actually works, maybe individual yaml file rules are different from whole roles.
When i ran my find-based check i got a lot of extra [204] Lines should be no longer than 160 chars

problem with snakemake submitting jobs with multiple wildcard on SGE

I used snakemake on LSF cluster before and everything worked just fine. However, recently I migrated to SGE cluster and I am getting a very strange error when I try to run a job with more than one wildcard.
When I try to submit a job based on this rule
rule download_reads :
threads : 1
output : "data/{sp}/raw_reads/{accesion}_1.fastq.gz"
shell : "scripts/download_reads.sh {wildcards.sp} {wildcards.accesion} data/{wildcards.sp}/raw_reads/{wildcards.accesion}"
I get a following error (snakemake_clust.sh details bellow)
./snakemake_clust.sh data/Ecol1/raw_reads/SRA123456_1.fastq.gz
Building DAG of jobs...
Using shell: /bin/bash
Provided cluster nodes: 10
Job counts:
count jobs
1 download_reads
1
[Thu Jul 30 12:08:57 2020]
rule download_reads:
output: data/Ecol1/raw_reads/SRA123456_1.fastq.gz
jobid: 0
wildcards: sp=Ecol1, accesion=SRA123456
scripts/download_reads.sh Ecol1 SRA123456 data/Ecol1/raw_reads/SRA123456
Unable to run job: ERROR! two files are specified for the same host
ERROR! two files are specified for the same host
Exiting.
Error submitting jobscript (exit code 1):
Shutting down, this might take some time.
When I replace the sp wildcard with a constant, it works as expected:
rule download_reads :
threads : 1
output : "data/Ecol1/raw_reads/{accesion}_1.fastq.gz"
shell : "scripts/download_reads.sh Ecol1 {wildcards.accesion} data/Ecol1/raw_reads/{wildcards.accesion}"
I.e. I get
Submitted job 1 with external jobid 'Your job 50731 ("download_reads") has been submitted'.
I wonder why I might have this problem, I am sure I used exactly the same rule on the LSF-based cluster before without any problem.
some details
The snakemake submitting script looks like this
#!/usr/bin/env bash
mkdir -p logs
snakemake $# -p --jobs 10 --latency-wait 120 --cluster "qsub \
-N {rule} \
-pe smp64 \
{threads} \
-cwd \
-b y \
-o \"logs/{rule}.{wildcards}.out\" \
-e \"logs/{rule}.{wildcards}.err\""
-b y makes the command executed as it is, -cwd changes the working directory on the computing node the the working directory from where the job was submitted. Other flags / specifications are clear I hope.
Also, I am aware of --drmaa flag, but I think out cluster is not well configured for that. --cluster was till now a more robust solution.
-- edit 1 --
When I execute exactly the same snakefile locally (on the fronend, without the --cluster flag), the script gets executed as expected. It seems to be a problem of interaction of snakemake and the scheduler.
-o \"logs/{rule}.{wildcards}.out\" \
-e \"logs/{rule}.{wildcards}.err\""
This is a random guess... More than one wildcards are concatenated by space before replacing them into logs/{rule}.{wildcards}.err. So despite you use double quotes, SGE treats the resulting string as two files and throws the error. What if you use single quotes instead? Like:
-o 'logs/{rule}.{wildcards}.out' \
-e 'logs/{rule}.{wildcards}.err'
Alternatively, you could concatenate the wildcards in the rule and use the result on the command line. E.g.:
rule one:
params:
wc= lambda wc: '_'.join(wc)
output: ...
Then use:
-o 'logs/{rule}.{params.wc}.out' \
-e 'logs/{rule}.{params.wc}.err'
(This second solution, if it works, kind of sucks though)

Passing a string containing curly brackets to helm caused: Error: failed parsing --set data: key

${CLIENT_ID} and ${CLIENT_SECRET} are both sourced from a yaml based properties file e.g
CLIENT_ID: 11111111-1111-1111-1111-111111111111
CLIENT_SECRET: '{aes}AM+JYP8t9ga1m6s111x1fjdfePL10v90RmbgWFdOjVdD/wlnszAbJad8aOI4qqMv6eSGaW2nfTF4PG2OYH+rx9K052TXNP6PGAAcRph9pl11'
using:
PROPERTIES_FILE="properties.yaml"
CLIENT_ID=$(yq r "${PROPERTIES_FILE}" CLIENT_ID)
CLIENT_SECRET=$(yq r "${PROPERTIES_FILE}" CLIENT_SECRET)
and then passed into my helm command for deploying my app:
echo ${CLIENT_ID}
# 11111111-1111-1111-1111-111111111111
echo ${CLIENT_SECRET}
# {aes}AM+JYP8t9ga1m6s111x1fjdfePL10v90RmbgWFdOjVdD/wlnszAbJad8aOI4qqMv6eSGaW2nfTF4PG2OYH+rx9K052TXNP6PGAAcRph9pl11
helm upgrade -i --debug --namespace mynamespace release \
-f "charts/app/values.yaml" \
--set "app.configmap.dependancy.client_id=${CLIENT_ID}" \
--set "app.configmap.dependancy.client_secret=${CLIENT_SECRET}" \
"charts/app/"
charts/app/values.yaml contains:
app:
..
configmap:
dependancy:
client_id: ""
client_secret: ""
The problem is, I get this error when running the helm command:
Error: failed parsing --set-string data: key "AM+JYP8t9ga1m6s111x1fjdfePL10v90RmbgWFdOjVdD/wlnszAbJad8aOI4qqMv6eSGaW2nfTF4PG2OYH+rx9K052TXNP6PGAAcRph9pl11" has no value
No resources found.
Any idea why the prefixed {aes} is causing problems when passed into helm like this? The command works if I remove the {aes} prefix.
Helm tries to make it possible to pass some structured data in using --set and that's tripping you up here. In particular,
Lists can be expressed by enclosing values in { and }. For example, --set name={a, b, c} translates to [a list of a, b, and c].
So if you --set 'key={aes}AM+JYP8...', the {aes} part looks like that list syntax but then there's things after it that Helm doesn't understand.
You can backslash-escape the curly braces, though that's a little tricky to do in shell syntax. There's a --set-string option (documented in the Helm 2 documentation but still present in Helm 3) that might be able to do this. Possibly the most straightforward path, though, is to write out your own custom YAML values file:
#!/bin/sh
PROPERTIES_FILE="properties.yaml"
CLIENT_ID=$(yq r "${PROPERTIES_FILE}" CLIENT_ID)
CLIENT_SECRET=$(yq r "${PROPERTIES_FILE}" CLIENT_SECRET)
cat >values.tmp.yaml <<EOF
app:
configmap:
dependancy:
client_id: "${CLIENT_ID}"
client_secret: "${CLIENT_SECRET}"
EOF
helm upgrade -i --debug --namespace mynamespace release \
-f values.tmp.yaml charts/app
(You can have multiple helm install -f options if you need to. The chart's values.yaml is read in automatically, and is overridden by command-line options.)

Resources