merge multiple YAML files under a new field - yaml

I have several YAML files representing the pieces of a whole. I want to merge them under a new field ("guests") that declares the whole.
file1.yml
name: johnny
age: 23
file2.yml
name: sally
age: 21
output.yml
guests:
- name: johnny
age: 23
- name: sally
age: 21
tools like yq make merging/overwriting easy, but I can't find any that helps me nest values under new fields.

The tool you are looking for comes with several different names and
are called programming languages or scripting languages. I recommend
you use Python with ruamel.yaml installed. (disclaimer: I am the author of
that package).
Once you have that you can do:
python -c "import sys, ruamel.yaml; yaml=ruamel.yaml.YAML(); yaml.indent(sequence=4, offset=2); yaml.dump(dict(guest=[yaml.load(open(f)) for f in sys.argv[1:]]), sys.stdout)" file*.yml > output.yml
To get the desired output.
A few notes:
YAML files should have the .yaml extension unless your filesystem doesn't support that.
By default sequence elements are indented two spaces and the dash has no offset within that (i.e. would align with the g of guests. Hence the yaml.indent() call.
Any comments on the key-value files of your input file would be preserved but not automatically pushed out to the right from their original starting column unless necessary because of a mapping value getting in the way. Adjusting that is possible, but I would not recommend trying that in a one-liner.
If you need to preserve quotes add yaml.preserve_quotes = True; in the one-liner
If any of your YAML files contain multiple YAML documents, the above will fail. You would need to think about how to combine the documents, and use a try except clause to fall back to yaml.load_all() for documents that do (it would be a good idea to abandon the one-liner in favour of a multiline Python program at that point).
You can also do the above using the yaml commandline utility (installable with pip install ruamel.yaml.cmd>=0.5.0):
yaml from-dirs --sequence ./*.yml | yaml map --indent 2,4,2 guest - > output.yml
but this is a two step process (first combining multiple yaml files as root level sequence, then pushing that sequence to be a value for mapping), and thus twice as slow as the one-liner.

Related

Helm split global section

I have a helm values.yaml file like below
global:
foo: bar
foo1: bar1
random-chart:
fooo: baar
My use case is to append values in both global and random-chart during run time.
After appending values, the chart looks like this.
global:
foo: bar
foo1: bar1
random-chart:
fooo: baar
global:
secret: password
random-chart:
secret1: password1
Since there's 2 different global and random-chart keys. Will it work as intended and is it a good practice to do that?
This probably won't work as intended.
The YAML 1.2.2 spec notes (emphasis from original):
The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique.
And in discussing loading errors continues:
... mapping keys may not be unique ....
So the YAML file you show has a mapping with two keys both named global and two keys both named random-chart, and that's not valid. Depending on the specific YAML library that's being used, this might be interpreted as a loading error, or the library might just pick the last value of global.
In general, it's hard to work with YAML files using line-oriented shell tools, since there are so many syntactic variations. A dedicated library in a higher-level language will usually work better. For example, using the Python PyYAML library:
import yaml
with open('values.in.yaml', 'r') as f:
values = yaml.safe_load(f)
values['global']['secret'] = 'password'
values['random-chart']['secret-1'] = 'password1'
with open('values.out.yaml', 'w') as f:
yaml.dump(values. f)
Two other possibilities to consider: you can have multiple helm install -f options, so it's possible to write out a file with just the values you're adding, and those will be merged with other settings (you do not need to repeat the values from the chart's values.yaml file). Depending on your environment, you also may find it easier to dynamically write out JSON files, particularly if you don't need to re-read the base chart; setups like Jenkins or Javascript applications will have built-in JSON support, and valid JSON turns out to be valid YAML.

How to concatenate many files using their basenames?

I study genetic data from 288 fish samples (Fish_one, Fish_two ...)
I have four files per fish, each with a different suffix.
eg. for sample_name Fish_one:
file 1 = "Fish_one.1.fq.gz"
file 2 = "Fish_one.2.fq.gz"
file 3 = "Fish_one.rem.1.fq.gz"
file 4 = "Fish_one.rem.2.fq.gz"
I would like to apply the following concatenate instructions to all my samples, using maybe a text file containing a list of all the sample_name, that would be provided to a loop?
cp sample_name.1.fq.gz sample_name.fq.gz
cat sample_name.2.fq.gz >> sample_name.fq.gz
cat sample_name.rem.1.fq.gz >> sample_name.fq.gz
cat sample_name.rem.2.fq.gz >> sample_name.fq.gz
In the end, I would have only one file per sample, ideally in a different folder.
I would be very grateful to receive a bit of help on this one, even though I'm sure the answer is quite simple for a non-novice!
Many thanks,
NoƩ
I would like to apply the following concatenate instructions to all my
samples, using maybe a text file containing a list of all the
sample_name, that would be provided to a loop?
In the first place, the name of the cat command is mnemonic for "concatentate". It accepts multiple command-line arguments naming sources to concatenate together to the standard output, which is exactly what you want to do. It is poor form to use a cp and three cats where a single cat would do.
In the second place, although you certainly could use a file of name stems to drive the operation you describe, it's likely that you don't need to go to the trouble to create or maintain such a file. Globbing will probably do the job satisfactorily. As long as there aren't any name stems that need to be excluded, then, I'd probably go with something like this:
for f in *.rem.1.fq.gz; do
stem=${f%.rem.1.fq.gz}
cat "$stem".{1,2,rem.1,rem.2}.fq.gz > "${other_dir}/${stem}.fq.gz"
done
That recognizes the groups present in the current working directory by the members whose names end with .rem.1.fq.gz. It extracts the common name stem from that member's name, then concatenates the four members to the correspondingly-named output file in the directory identified by ${other_dir}. It relies on brace expansion to form the arguments to cat, so as to minimize code and (IMO) improve clarity.

How to Reference an aliased map value in YAML

I have a feeling this isn't possible, but I have a snippet of YAML that looks like the following:
.map_values: &my_map
a: 'D'
b: 'E'
a: 'F'
section:
stage: *my_map['b']
I would like stage to have the value of E.
Is this possible within YAML? I've tried just about every incarnation of substitution I can think of.
Since there is a duplicate key in your mapping, which is not allowed
in YAML 1.2 (and should at least throw a warning in YAML 1.1) this is
not going to work, but even if you correct that, you can't do that
with just anchors and aliases.
The only substitution like replacement that is available in YAML is the "Merge Key Language-Independent Type". That is indirectly referenced in the YAML spec, and not included in it, but available in most parsers.
The only thing that allows it to do is "update" a mapping with key value pairs of one or more other mappings, if the key doesn't already exist in the mapping. You use the special key << for that, which takes an alias, or a list of aliases.
There is no facility, specified in the YAML specification, to dereference particular keys.
There are some systems that use templates that generate YAML, but there are two main problems to apply these here:
the template languages themselves often are clashing with the indicators in the YAML syntax,
making the template not valid YAML
even if the template could be loaded as valid YAML, and the values extracted that are needed to
update other parts of the template, you would need to parse the input twice (once to get the
values to update the template, then to parse the updated template). Given the potential
complexity of YAML and the relative slow speed of its parsers, this can be prohibitive
What you can do is create some tag (e.g. !lookup) and have its constructor interpret that node.
Since the node has to be valid YAML again you have to decide on whether to use a sequence or a mapping.
You'll have to include some special syntax for the values in both cases, and also for the key
(like the << used in merges) in the case of mappings.
In the examples I left out the spurious single quotes, depending on
your real values you might of course need them.
Example using sequence :
.map_values: &my_map
a: D
b: E
c: F
section: !Lookup
- *my_map
- stage: <b>
Example using mapping:
.map_values: &my_map
a: D
b: E
c: F
section: !Lookup
<<: *my_map
stage: <b>
Both can be made to construct the data on the fly (i.e. no past
loading processing of your data structure necessary). E.g. using Python and
the sequence "style" in input.yaml:
import sys
import ruamel.yaml
from pathlib import Path
input = Path('input.yaml')
yaml = ruamel.yaml.YAML(typ='safe')
yaml.default_flow_style = False
#yaml.register_class
class Lookup:
#classmethod
def from_yaml(cls, constructor, node):
"""
this expects a two entry sequence, in which the first is a mapping X, typically using
an alias
the second entry should be an mapping, for which the values which have the form <key>
are looked up in X
non-existing keys will throw an error during loading.
"""
X, res = constructor.construct_sequence(node, deep=True)
yield res
for key, value in res.items():
try:
if value.startswith('<') and value.endswith('>'):
res[key] = X[value[1:-1]]
except AttributeError:
pass
return res
data = yaml.load(input)
yaml.dump(data, sys.stdout)
which gives:
.map_values:
a: D
b: E
c: F
section:
stage: E
There are a few things to note:
using <...> is arbitrary, you don't need a both beginning and an
end marker. I do recommend using some character(s) that has no
special meaning in YAML, so you don't need to quote your values. You can e.g. use some
well recognisable unicode point, but they tend to be a pain to type in an editor.
when from_yaml is called, the anchor is not yet fully constructed. So X is an empty dict
that gets filled later on. The constructed with yield implements a two step process: we first
give back res "as-is" back to the constructor, then later update it. The constructor stage of
the loader knows how to handle this automatically when it gets the generator instead a "normal" value.
the try .. except is there to handle mapping values that are not strings (i.e. numbers, dates, boolean).
you can do substitutions in keys as well, just make sure you delete the old key
Since tags are standard YAML, the above should be doable one way or another in any
YAML parser, independent of the language.

How to fetch all records using NCBI Batch Entrez

I have over 200,000 accessions in a flat file, which need to retrieve relevant entry from NBCI.
I use Batch Entrez (http://www.ncbi.nlm.nih.gov/sites/batchentrez) to do the job. But encountered several problems:
The initial file was splitted into multiple sub-files, each containing 4000 lines. But it seems Batch Entrez has some size limitation on the returned file. For example: if the first 1000 accessions all have tens of thousands lines which reach the size limitation, then the rest 3000 accessions will be rejected and won't be searched.
One possible solution in my head is to split the file into more sub-files and search individually. However this requires too much manual effort.
So I am just wondering if there is any other solution, or any code could be used.
Thanks in advance
Your problem sounds a good fit for a Bio-star toolkit. This is a solution using BioSmalltalk
| giList gbReader |
giList := (BioObject openFullFileNamed: 'd:\Batch_entrez_1.txt') contents lines.
gbReader := BioNCBIGenBankReader new.
gbReader
genBankRecordsFrom: 'nuccore'
format: #setModeXML
uids: giList.
(BioGBSeqCollection newFromXMLCollection: gbReader searchResults)
collect: [: e | BioParser
tokenizeNcbiXmlBlast: e contents
nodes: #('GBAuthor' 'GBSeq_definition') ]
To execute/debug the script, just select it and a right-click will open the Smalltalk world-menu.
The API automatically split and fetch your accession list (in the script contained in Batch_entrez_1.txt) maintaining the NCBI Entrez post limits to avoid penalities.
The result format is XML (which is an "easy" format to parse or filter specific fields) although it could be any of the retrieval modes supported by Entrez, for example setting #setModeText will answer an ASN.1 representation. Replace 'nuccore' for the database you want to query. Finally choose the interesting fields, in the script I have choosed 'GBAuthor' and 'GBSeq_definition', but you are free to choose anyone of the available nodes.

Import a YAML (fixtures) file from another

for better organization (for eg: seed and test data), is it
possible to split the yaml file and import the first one from second.
Of course, the variables in file1 should be available to use in the
second. I use Snakeyaml based parser i java, if that matters.
thanks.
Update 1: (example)
Seed file: seed.yaml
Priority(L1E1):
level: 1
priorityCode: E1
description: Escalation
test data file: test-data.yaml
Request(RER1):
priority: L1E1
title: Something
So, I need to split the files, as they becoming huge. Also the variable/data (L1E1 in this case) defined in one file needs to be accessible in the second file.
YAML does not define an "include" directive.
What do you mean by "the variables in file1 should be available to use in the second" ? Do you expect anchors and aliases to work across the files ?

Resources