Trying to get all paths in a YAML file - yaml

I've got an input YAML file (test.yml) as follows:
# sample set of lines
foo:
x: 12
y: hello world
ip_range['initial']: 1.2.3.4
ip_range[]: tba
array['first']: Cluster1
array2[]: bar
The source contains square brackets for some keys (possibly empty).
I'm trying to get a line by line list of all the paths in the file, ideally like:
foo.x: 12
foo.y: hello world
foo.ip_range['initial']: 1.2.3.4
foo.ip_range[]: tba
foo.array['first']: Cluster1
array2[]: bar
I've used the yamlpaths library and the yaml-paths CLI, but can't get the desired output. Trying this:
yaml-paths -m -s =foo -K test.yml
outputs:
foo.x
foo.y
foo.ip_range\[\'initial\'\]
foo.ip_range\[\]
foo.array\[\'first\'\]
Each path is on one line, but the output has all the escape characters ( \ ). Modifying the call to remove the -m option ("expand matching parent nodes") fixes that problem but the output is then not one path per line:
yaml-paths -s =foo -K test.yml
gives:
foo: {"x": 12, "y": "hello world", "ip_range['initial']": "1.2.3.4", "ip_range[]": "tba", "array['first']": "Cluster1"}
Any ideas how I can get the one line per path entry but without the escape chars? I was wondering if there is anything for path querying in the ruamel modules?

Your "paths" are nothing more than the joined string representation of the keys (and probably indices) of the
mappings (and potentially sequences) in your YAML document.
That can be trivially generated from data loaded from YAML with a recursive function:
import sys
import ruamel.yaml
yaml_str = """\
# sample set of lines
foo:
x: 12
y: hello world
ip_range['initial']: 1.2.3.4
ip_range[]: tba
array['first']: Cluster1
array2[]: bar
"""
def pathify(d, p=None, paths=None, joinchar='.'):
if p is None:
paths = {}
pathify(d, "", paths, joinchar=joinchar)
return paths
pn = p
if p != "":
pn += '.'
if isinstance(d, dict):
for k in d:
v = d[k]
pathify(v, pn + k, paths, joinchar=joinchar)
elif isinstance(d, list):
for idx, e in enumerate(d):
pathify(e, pn + str(idx), paths, joinchar=joinchar)
else:
paths[p] = d
yaml = ruamel.yaml.YAML(typ='safe')
paths = pathify(yaml.load(yaml_str))
for p, v in paths.items():
print(f'{p} -> {v}')
which gives:
foo.x -> 12
foo.y -> hello world
foo.ip_range['initial'] -> 1.2.3.4
foo.ip_range[] -> tba
foo.array['first'] -> Cluster1
array2[] -> bar

While Anthon's answer certainly produces the output you were after, I think your question was specifically about how to get the yaml-paths command to produce the desired output. I'll address that original question.
As of version 3.5.0, the yamlpath project's yaml-paths command supports a --noescape option which removes the escape symbols from output. Using your input file and the new option, you may find this output more to your liking:
$ yaml-paths --nofile --expand --keynames --noescape --values --search='=~/.*/' test.yml
foo.x: 12
foo.y: hello world
foo.ip_range['initial']: 1.2.3.4
foo.ip_range[]: tba
foo.array['first']: Cluster1
array2[]: bar
Note:
Using the --values option includes the value with each YAML Path.
For interest, I changed the --search expression to match every node in the input file rather than only the "foo" data.
The default output (without setting --noescape) produces YAML Paths which can be used as direct input into other YAML Path parsers and processors; setting --noescape changes this to render human-friendly paths which may not work as downstream YAML Path input.
Disclaimer: I am the author of the yamlpath project. Should you ever run into issues or have questions about it, please visit the project's GitHub project site and engage me via Issues (bugs and feature requests) or Discussions (questions). Thank you!

Related

Snakemake - parameter file treated as a wildcard

I have written a pipeline in Snakemake. It's an ATAC-seq pipeline (bioinformatics pipeline to analyze genomics data from a specific experiment). Basically, until merging alignment step I use {sample_id} wildcard, to later switch to {sample} wildcard (merging two or more sample_ids into one sample).
working DAG here (for simplicity only one sample shown; orange and blue {sample_id}s are merged into one green {sample}
Tha all rule looks as follows:
configfile: "config.yaml"
SAMPLES_DICT = dict()
with open(config['SAMPLE_SHEET'], "r+") as fil:
next(fil)
for lin in fil.readlines():
row = lin.strip("\n").split("\t")
sample_id = row[0]
sample_name = row[1]
if sample_name in SAMPLES_DICT.keys():
SAMPLES_DICT[sample_name].append(sample_id)
else:
SAMPLES_DICT[sample_name] = [sample_id]
SAMPLES = list(SAMPLES_DICT.keys())
SAMPLE_IDS = [sample_id for sample in SAMPLES_DICT.values() for sample_id in sample]
rule all:
input:
# FASTQC output for RAW reads
expand(os.path.join(config['FASTQC'], '{sample_id}_R{read}_fastqc.zip'),
sample_id = SAMPLE_IDS,
read = ['1', '2']),
# Trimming
expand(os.path.join(config['TRIMMED'],
'{sample_id}_R{read}_val_{read}.fq.gz'),
sample_id = SAMPLE_IDS,
read = ['1', '2']),
# Alignment
expand(os.path.join(config['ALIGNMENT'], '{sample_id}_sorted.bam'),
sample_id = SAMPLE_IDS),
# Merging
expand(os.path.join(config['ALIGNMENT'], '{sample}_sorted_merged.bam'),
sample = SAMPLES),
# Marking Duplicates
expand(os.path.join(config['ALIGNMENT'], '{sample}_sorted_md.bam'),
sample = SAMPLES),
# Filtering
expand(os.path.join(config['FILTERED'],
'{sample}.bam'),
sample = SAMPLES),
expand(os.path.join(config['FILTERED'],
'{sample}.bam.bai'),
sample = SAMPLES),
# multiqc report
"multiqc_report.html"
message:
'\n#################### ATAC-seq pipeline #####################\n'
'Running all necessary rules to produce complete output.\n'
'############################################################'
I know it's too messy, I should only leave the necessary bits, but here my understanding of snakemake fails cause I don't know what I have to keep and what I should delete.
This is working, to my knowledge exactly as I want.
However, I added a rule:
rule hmmratac:
input:
bam = os.path.join(config['FILTERED'], '{sample}.bam'),
index = os.path.join(config['FILTERED'], '{sample}.bam.bai')
output:
model = os.path.join(config['HMMRATAC'], '{sample}.model'),
gappedPeak = os.path.join(config['HMMRATAC'], '{sample}_peaks.gappedPeak'),
summits = os.path.join(config['HMMRATAC'], '{sample}_summits.bed'),
states = os.path.join(config['HMMRATAC'], '{sample}.bedgraph'),
logs = os.path.join(config['HMMRATAC'], '{sample}.log'),
sample_name = '{sample}'
log:
os.path.join(config['LOGS'], 'hmmratac', '{sample}.log')
params:
genomes = config['GENOMES'],
blacklisted = config['BLACKLIST']
resources:
mem_mb = 32000
message:
'\n######################### Peak calling ########################\n'
'Peak calling for {output.sample_name}\n.'
'############################################################'
shell:
'HMMRATAC -Xms2g -Xmx{resources.mem_mb}m '
'--bam {input.bam} --index {input.index} '
'--genome {params.genome} --blacklist {params.blacklisted} '
'--output {output.sample_name} --bedgraph true &> {log}'
And into the rule all, after filtering, before multiqc, I added:
# Peak calling
expand(os.path.join(config['HMMRATAC'], '{sample}.model'),
sample = SAMPLES),
Relevant config.yaml fragments:
# Path to blacklisted regions
BLACKLIST: "/mnt/data/.../hg38.blacklist.bed"
# Path to chromosome sizes
GENOMES: "/mnt/data/.../hg38_sizes.genome"
# Path to filtered alignment
FILTERED: "alignment/filtered"
# Path to peaks
HMMRATAC: "peaks/hmmratac"
This is the error* I get (It goes on for every input and output of the rule). *Technically it's a warning but it halts execution of snakemake so I am calling it an error.
File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
WARNING:snakemake.logging:File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
It isn't actually ... - I just didn't feel safe providing an absolute path here.
For a couple of days, I have struggled with this error. Looked through the documentation, listened to the introduction. I understand that the above description is far from perfect (it is huge bc I don't even know how to work it down to provide minimal reproducible example...) but I am desperate and hope you can be patient with me.
Any suggestions as to how to google it, where to look for an error would be much appreciated.
Technically it's a warning but it halts execution of snakemake so I am calling it an error.
It would be useful to post the logs from snakemake to see if snakemake terminated with an error and if so what error.
However, in addition to Eric C.'s suggestion to use wildcards.sample instead of {sample} as file name, I think that this is quite suspicious:
alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam
/mnt/ is usually at the root of the file system and you are prepending to it a relative path (alignment/filtered). Are you sure it is correct?

Snakemake, how to change output filename when using wildcards

I think I have a simple problem but I don't how to solve it.
My input folder contains files like this:
AAAAA_S1_R1_001.fastq
AAAAA_S1_R2_001.fastq
BBBBB_S2_R1_001.fastq
BBBBB_S2_R2_001.fastq
My snakemake code:
import glob
samples = [os.path.basename(x) for x in sorted(glob.glob("input/*.fastq"))]
name = []
for x in samples:
if "_R1_" in x:
name.append(x.split("_R1_")[0])
NAME = name
rule all:
input:
expand("output/{sp}_mapped.bam", sp=NAME),
rule bwa:
input:
R1 = "input/{sample}_R1_001.fastq",
R2 = "input/{sample}_R2_001.fastq"
output:
mapped = "output/{sample}_mapped.bam"
params:
ref = "refs/AF086833.fa"
run:
shell("bwa mem {params.ref} {input.R1} {input.R2} | samtools sort > {output.mapped}")
The output file names are:
AAAAA_S1_mapped.bam
BBBBB_S2_mapped.bam
I want the output file to be:
AAAAA_mapped.bam
BBBBB_mapped.bam
How can I or change the outputname or rename the files before or after the bwa rule.
Try this:
import pathlib
indir = pathlib.Path("input")
paths = indir.glob("*_S?_R?_001.fastq")
samples = set([x.stem.split("_")[0] for x in paths])
rule all:
input:
expand("output/{sample}_mapped.bam", sample=samples)
def find_fastqs(wildcards):
fastqs = [str(x) for x in indir.glob(f"{wildcards.sample}_*.fastq")]
return sorted(fastqs)
rule bwa:
input:
fastqs = find_fastqs
output:
mapped = "output/{sample}_mapped.bam"
params:
ref = "refs/AF086833.fa"
shell:
"bwa mem {params.ref} {input.fastqs} | samtools sort > {output.mapped}"
Uses an input function to find the correct samples for rule bwa. There might be a more elegant solution, but I can't see it right now. I think this should work, though.
(Edited to reflect OP's edit.)
Unfortunately, I've also had this problem with filenames with the following logic: {batch}/{seq_run}_{index}_{flowcell}_{lane}_{read_orientation}.fastq.gz.
I think that the core problem is that none of the individual wildcards are unique. Also, not all values for all wildcards can be combined; seq_run1 was run on lane1, not lane2. Therefore, expand() does not work.
After multiple attempts in Snakemake (see below), my solution was to standardize input with mv / sed / rename. Removing {batch}, {flowcell} and {lane} made it possible to use {sample}, a unique combination of {seq_run} and {index}.
What did not work (but it could be worth to try for others in the same situation):
Adding the zip argument to expand()
Renaming output using the following syntax:
output: "_".join(re.split("[/_]", "{full_filename}")[1,2]+".fastq.gz"

mapping reads using snakemake

I am trying to run hisat2 mapping using the snakemake.
Basically, I'm using a config.yaml file like this:
reads:
set1: /path/to/set1/samplelist.tab
hisat2:
database: genome
genome: genome.fa
nodes: 2
memory: 8G
arguments: --dta
executables:
hisat2: /Tools/hisat2-2.1.0/hisat2
samtools: /Tools/samtools-1.3/samtools
Then Snakefile:
configfile: "config.yaml"
workdir: "/path/to/working_dir/"
# Hisat2
rule hisat2:
input:
reads = lambda wildcards: config["reads"][wildcards.sample]
output:
bam = "{sample}/{sample}.bam"
params:
idx=config["hisat2"]["database"],
executable = config["executables"]["hisat2"],
nodes = config["hisat2"]["nodes"],
memory = config["hisat2"]["memory"],
executable2 = config["executables"]["samtools"]
run:
shell("{params.executable} --dta -p {params.nodes} -x {params.idx} {input.reads} |"
"{params.executable2} view -Sbh -o {output.bam} -")
# all
rule all:
input:
lambda wildcards: [sample + "/" + sample + ".bam"
for sample in config["reads"].keys()]
My samplelist.tab is like this:
id reads1 reads2
set1a set1a_R1.fastq.gz set1a_R2.fastq.gz
set1b set1b_R1.fastq.gz set1b_R2.fastq.gz
Any hints how to make this working? I apoligize for a messy script, just started using snakemake.
You will have to do something like this:
import pandas as pd
reads = pd.read_csv(config["reads"]['set1'], sep='\t', index_col=0)
def get_fastq(wildcards):
return list(reads.loc[wildcards.sample].values)
rule hisat2:
input:
get_fastq
...
First you will need to load the samplelist and store this (I did it as a pandas dataframe). Then you can lookup which files belong to that sample name.
Edit:
Rewriting the code to look like this is much more readable (in my opinion).
rule hisat2:
input:
[{sample}_R1.fastq.gz,
{sample}_R2.fastq.gz]
...

Accessing environment variables in a YAML file for Ruby project (using ${ENVVAR} syntax)

I am building an open source project using Ruby for testing HTTP services: https://github.com/Comcast/http-blackbox-test-tool
I want to be able to reference environment variables in my test-plan.yaml file. I could use ERB, however I don't want to support embedding any random Ruby code and ERB syntax is odd for non-rubyists, I just want to access environment variables using the commonly used Unix style ${ENV_VAR} syntax.
e.g.
order-lunch-app-health:
request:
url: ${ORDER_APP_URL}
headers:
content-type: 'application/text'
method: get
expectedResponse:
statusCode: 200
maxRetryCount: 5
All examples I have found for Ruby use ERB. Does anyone have a suggestion on the best way to deal with this? I an open to using another tool to preprocess the YAML and then send that to the Ruby application.
I believe something like this should work under most circumstances:
require 'yaml'
def load_yaml(file)
content = File.read file
content.gsub! /\${([^}]+)}/ do
ENV[$1]
end
YAML.load content
end
p load_yaml 'sample.yml'
As opposed to my original answer, this is both simpler and handles undefined ENV variables well.
Try with this YAML:
# sample.yml
path: ${PATH}
home: ${HOME}
error: ${NO_SUCH_VAR}
Original Answer (left here for reference)
There are several ways to do it. If you want to allow your users to use the ${VAR} syntax, then perhaps one way would be to first convert these variables to Ruby string substitution format %{VAR} and then evaluate all environment variables together.
Here is a rough proof of concept:
require 'yaml'
# Transform environments to a hash of { symbol: value }
env_hash = ENV.to_h.transform_keys(&:to_sym)
# Load the file and convert ${ANYTHING} to %{ANYTHING}
content = File.read 'sample.yml'
content.gsub! /\${([^}]+)}/, "%{\\1}"
# Use Ruby string substitution to replace %{VARS}
content %= env_hash
# Done
yaml = YAML.load content
p yaml
Use it with this sample.yml for instance:
# sample.yml
path: ${PATH}
home: ${HOME}
There are many ways this can be improved upon of course.
Preprocessing is easy, and I recommend you use a YAML loaderd/dumper
based solution, as the replacement might require quotes around the
replacement scalar. (E.g. you substitute the string true, if that
were not quoted, the resulting YAML would be read as a boolean).
Assuming your "source" is in input.yaml and your env. variable
ORDER_APP_URL set to https://some.site/and/url. And the following
script in expand.py:
import sys
import os
from pathlib import Path
import ruamel.yaml
def substenv(d, env):
if isinstance(d, dict):
for k, v in d.items():
if isinstance(v, str) and '${' in v:
d[k] = v.replace('${', '{').format(**env)
else:
substenv(v, env)
elif isinstance(d, list):
for idx, item in enumerate(d):
if isinstance(v, str) and '${' in v:
d[idx] = item.replace('${', '{').format(**env)
else:
substenv(item, env)
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load(Path(sys.argv[1]))
substenv(data, os.environ)
yaml.dump(data, Path(sys.argv[2]))
You can then do:
python expand.py input.yaml output.yaml
which writes output.yaml:
order-lunch-app-health:
request:
url: https://some.site/and/url
headers:
content-type: 'application/text'
method: get
expectedResponse:
statusCode: 200
maxRetryCount: 5
Please note that the spurious quotes around 'application/text' are preserved, as would be any comments
in the original file.
Quotes around the substituted URL are not necessary, but the would have been added if they were.
The substenv routine recursively traverses the loaded data, and substitutes even if the substitution is in mid-scalar, and if there are more than substitution in one scalar. You can "tighten" the test:
if isinstance(v, str) and '${' in v:
if that would match too many strings loaded from YAML.

Prettify YAML with comments

1. Summary
I can't find, how I can automatically prettify my YAML files.
2. Data
Example:
    I have SashaPrettifyYAML.yaml file:
sasha_commands:
# Sasha comment
sasha_command_help: {call: sublime.command_help, caption: 'Sasha Command: Command Help'}
3. Expected behavior
I want to delete {braces}:
sasha_commands:
# Sasha comment
sasha_command_help:
call: sublime.command_help
caption: 'Sasha Command: Command Help'
4. Not helped
Pretty YAML (based on PyYAML) and online formatters as YAML Formatter and OnlineYAMLTools delete comments;
I can't find the required option in ruamel.yaml.cmd;
align-yaml align, not prettify YAML file.
There is no option to do this in ruamel.yaml.cmd, but it is fairly straightforward to do this with a small python program and using ruamel.yaml, by loading and dumping in round-trip mode (the default).
The only thing you need to do is make sure the flow-style on the data-structure that is the value for the key sasha_command_help is set to block-style (which is how I interpret your definition of "prettifying YAML"):
import sys
import ruamel.yaml
yaml_str = """\
sasha_commands:
# Sasha comment
sasha_command_help: {call: sublime.command_help, caption: 'Sasha Command: Command Help'}
"""
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load(yaml_str)
data['sasha_commands']['sasha_command_help'].fa.set_block_style()
yaml.dump(data, sys.stdout)
this will exactly give the output you expect.
A recursive data structure walker can be found in scalarstring.py in the ruamel.yaml source, and adapted to make a generic "make-everything-block-style" routine:
import sys
import ruamel.yaml
def block_style(base):
"""
This routine walks over a simple, i.e. consisting of dicts, lists and
primitives, tree loaded from YAML. It recurses into dict values and list
items, and sets block-style on these.
"""
if isinstance(base, dict):
for k in base:
try:
base.fa.set_block_style()
except AttributeError:
pass
block_style(base[k])
elif isinstance(base, list):
for elem in base:
try:
base.fa.set_block_style()
except AttributeError:
pass
block_style(elem)
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
file_in = sys.argv[1]
file_out = sys.argv[2]
with open(file_in) as fp:
data = yaml.load(fp)
block_style(data)
with open(file_out, 'w') as fp:
yaml.dump(data, fp)
If you store the above in prettifyyaml.py you can call it with:
python prettifyyaml.py SashaPrettifyYAML.yaml Prettified.yaml
Since you are already using single quotes around the scalar that has embedded spaces, you won't see a change if you leave out yaml.preserve_quotes = True. But if you had used a double quoted scalar then that line makes sure the double quotes are preserved.
I had the same problem. I wrote my own YAML beautifier https://github.com/wangkuiyi/yamlfmt. I hope it helps.
I tried top results from Google, but none of them address the requirements of https://sqlflow.org/sqlflow, which I am leading:
https://pypi.org/project/yamlfmt cannot handle a file of multiple YAML documents separated by ---
https://github.com/devopyio/yamlfmt cannot handle multiple files.
https://github.com/miekg/yamlfmt/blob/master/fmt.go cannot replace (inline edit) the input files.
You can use yq tool - it's easy to install and use, and it's well maintained.
Supposing you have example.yml file to format, it can be processed by following ways:
from file: yq r --unwrapScalar -p pv -P example.yml '*'
from stdin: cat example.yml | yq r --unwrapScalar -p pv -P - '*'

Resources