Difference between Array and Dictionary in YAML - yaml

I understand the difference between an array and dictionary, say in python - especially when it comes to accessing content in both.
But in YAML, I kinda find it difficult to differentiate between the two especially when it comes to writing structures. Please How do they differ in their definitions?

# YAML sequences (aka Python lists)
# option 1: block style
departments:
- marketing
- sales
- security
# option 2: flow style
departments: [marketing, sales, security]
# YAML dictionary
marketing:
team-size: 20
location: nyc
# expanding the sequence to include dictionaries
departments:
- marketing:
team-size: 20
location: nyc
- sales:
team-size: 30
location: sf
- security:
team-size: 10
location: mia
Given your familiarity with Python, I'll focus more on syntax:
Sequences come in two styles: block and flow. When using block style, each element in the list is preceded by a dash and space -
Dictionaries come in the form of key: value pairs. They are defined with a name, colon and space name:
For every key: value pairing, the value is stored as a scalar (i.e. single value, not real number)
These scalars themselves can become keys to other value mappings
You can have nested sequences (lists) and nested mappings (dictionaries)
To clarify with an example: option 1 above shows a block style sequence. This is denoted with the spacing + -. We can convert each element, or value, in that mapping to a scalar by creating further indentations without the - (see #expanding the sequence)

#kebanus ...here's an example I am looking at
...everything looks like a normal dictionary to me...but the author calls them an array _ apart from the first basic key-value pairs:
doe: "a deer, a female deer"
ray: "a drop of golden sun"
pi: 3.14159
xmas: true
french-hens: 3
calling-birds:
- huey
- dewey
- louie
- fred
xmas-fifth-day:
calling-birds: four
french-hens: 3
golden-rings: 5
partridges:
count: 1
location: "a pear tree"
turtle-doves: two

Related

Repeatable list items in YAML [duplicate]

I would like to merge arrays in YAML, and load them via ruby -
some_stuff: &some_stuff
- a
- b
- c
combined_stuff:
<<: *some_stuff
- d
- e
- f
I'd like to have the combined array as [a,b,c,d,e,f]
I receive the error: did not find expected key while parsing a block mapping
How do I merge arrays in YAML?
If the aim is to run a sequence of shell commands, you may be able to achieve this as follows:
# note: no dash before commands
some_stuff: &some_stuff |-
a
b
c
combined_stuff:
- *some_stuff
- d
- e
- f
This is equivalent to:
some_stuff: "a\nb\nc"
combined_stuff:
- "a\nb\nc"
- d
- e
- f
I have been using this on my gitlab-ci.yml (to answer #rink.attendant.6 comment on the question).
Working example that we use to support requirements.txt having private repos from gitlab:
.pip_git: &pip_git
- git config --global url."https://gitlab-ci-token:${CI_JOB_TOKEN}#gitlab.com".insteadOf "ssh://git#gitlab.com"
- mkdir -p ~/.ssh
- chmod 700 ~/.ssh
- echo "$SSH_KNOWN_HOSTS" > ~/.ssh/known_hosts
- chmod 644 ~/.ssh/known_hosts
test:
image: python:3.7.3
stage: test
script:
- *pip_git
- pip install -q -r requirements_test.txt
- python -m unittest discover tests
use the same `*pip_git` on e.g. build image...
where requirements_test.txt contains e.g.
-e git+ssh://git#gitlab.com/example/example.git#v0.2.2#egg=example
This is not going to work:
merge is only supported by the YAML specifications for mappings and not for sequences
you are completely mixing things by having a merge key <<
followed by the key/value separator : and a value that is a
reference and then continue with a list at the same indentation
level
This is not correct YAML:
combine_stuff:
x: 1
- a
- b
So your example syntax would not even make sense as a YAML extension proposal.
If you want to do something like merging multiple arrays you might want to consider a syntax like:
combined_stuff:
- <<: *s1, *s2
- <<: *s3
- d
- e
- f
where s1, s2, s3 are anchors on sequences (not shown) that you
want to merge into a new sequence and then have the d, e and f
appended to that. But YAML is resolving these kind of structures depth
first, so there is no real context available during the processing
of the merge key. There is no array/list available to you where you
could attach the processed value (the anchored sequence) to.
You can take the approach as proposed by #dreftymac, but this has the huge disadvantage that you
somehow need to know which nested sequences to flatten (i.e. by knowing the "path" from the root
of the loaded data structure to the parent sequence), or that you recursively walk the loaded
data structure searching for nested arrays/lists and indiscriminately flatten all of them.
A better solution IMO would be to use tags to load data structures
that do the flattening for you. This allows for clearly denoting what
needs to be flattened and what not and gives you full control over
whether this flattening is done during loading, or done during
access. Which one to choose is a matter of ease of implementation and
efficiency in time and storage space. This is the same trade-off that needs to be made
for implementing the merge key feature and
there is no single solution that is always the best.
E.g. my ruamel.yaml library uses the brute force merge-dicts during
loading when using its safe-loader, which results in merged
dictionaries that are normal Python dicts. This merging has to be done
up-front, and duplicates data (space inefficient) but is fast in value
lookup. When using the round-trip-loader, you want to be able to dump
the merges unmerged, so they need to be kept separate. The dict like
datastructure loaded as a result of round-trip-loading, is space
efficient but slower in access, as it needs to try and lookup a key
not found in the dict itself in the merges (and this is not cached, so
it needs to be done every time). Of course such considerations are
not very important for relatively small configuration files.
The following implements a merge like scheme for lists in python using objects with tag flatten
which on-the-fly recurses into items which are lists and tagged toflatten. Using these two tags
you can have YAML file:
l1: &x1 !toflatten
- 1
- 2
l2: &x2
- 3
- 4
m1: !flatten
- *x1
- *x2
- [5, 6]
- !toflatten [7, 8]
(the use of flow vs block style sequences is completely arbitrary and has no influence on the
loaded result).
When iterating over the items that are the value for key m1 this
"recurses" into the sequences tagged with toflatten, but displays
other lists (aliased or not) as a single item.
One possible way with Python code to achieve that is:
import sys
from pathlib import Path
import ruamel.yaml
yaml = ruamel.yaml.YAML()
#yaml.register_class
class Flatten(list):
yaml_tag = u'!flatten'
def __init__(self, *args):
self.items = args
#classmethod
def from_yaml(cls, constructor, node):
x = cls(*constructor.construct_sequence(node, deep=True))
return x
def __iter__(self):
for item in self.items:
if isinstance(item, ToFlatten):
for nested_item in item:
yield nested_item
else:
yield item
#yaml.register_class
class ToFlatten(list):
yaml_tag = u'!toflatten'
#classmethod
def from_yaml(cls, constructor, node):
x = cls(constructor.construct_sequence(node, deep=True))
return x
data = yaml.load(Path('input.yaml'))
for item in data['m1']:
print(item)
which outputs:
1
2
[3, 4]
[5, 6]
7
8
As you can see you can see, in the sequence that needs flattening, you
can either use an alias to a tagged sequence or you can use a tagged
sequence. YAML doesn't allow you to do:
- !flatten *x2
, i.e. tag an
anchored sequence, as this would essentially make it into a different
datastructure.
Using explicit tags is IMO better than having some magic going on as
with YAML merge keys <<. If nothing else you now have to go through
hoops if you happen to have a YAML file with a mapping that has a key
<< that you don't want to act like a merge key, e.g. when you make a
mapping of C operators to their descriptions in English (or some other natural language).
Update: 2019-07-01 14:06:12
Note: another answer to this question was substantially edited with an update on alternative approaches.
That updated answer mentions an alternative to the workaround in this answer. It has been added to the See also section below.
Context
This post assumes the following context:
python 2.7
python YAML parser
Problem
lfender6445 wishes to merge two or more lists within a YAML file, and have those
merged lists appear as one singular list when parsed.
Solution (Workaround)
This may be obtained simply by assigning YAML anchors to mappings, where the
desired lists appear as child elements of the mappings. There are caveats to this, however, (see "Pitfalls" infra).
In the example below we have three mappings (list_one, list_two, list_three) and three anchors
and aliases that refer to these mappings where appropriate.
When the YAML file is loaded in the program we get the list we want, but
it may require a little modification after load (see pitfalls below).
Example
Original YAML file
list_one: &id001
- a
- b
- c
list_two: &id002
- e
- f
- g
list_three: &id003
- h
- i
- j
list_combined:
- *id001
- *id002
- *id003
Result after YAML.safe_load
## list_combined
[
[
"a",
"b",
"c"
],
[
"e",
"f",
"g"
],
[
"h",
"i",
"j"
]
]
Pitfalls
this approach produces a nested list of lists, which may not be the exact desired output, but this can be post-processed using the flatten method
the usual caveats to YAML anchors and aliases apply for uniqueness and declaration order
Conclusion
This approach allows creation of merged lists by use of the alias and anchor feature of YAML.
Although the output result is a nested list of lists, this can be easily transformed using the flatten method.
See also
Updated alternative approach by #Anthon
See alternative approach
Examples of the flatten method
Javascript flatten ;; Merge/flatten an array of arrays
Ruby flatten ;; http://ruby-doc.org/core-2.2.2/Array.html#method-i-flatten
Python flatten ;; https://softwareengineering.stackexchange.com/a/254676/23884
If you only need to merge one item into a list you can do
fruit:
- &banana
name: banana
colour: yellow
food:
- *banana
- name: carrot
colour: orange
which yields
fruit:
- name: banana
colour: yellow
food:
- name: banana
colour: yellow
- name: carrot
colour: orange
Another way to enable merging arrays in python is by defining a !flatten tag.
(This uses PyYAML, unlike Anthon's answer above. This may be necessary in cases when you don't have control over which package is used in the back end, e.g., anyconfig).
import yaml
yaml.add_constructor("!flatten", construct_flat_list)
def flatten_sequence(sequence: yaml.Node) -> Iterator[str]:
"""Flatten a nested sequence to a list of strings
A nested structure is always a SequenceNode
"""
if isinstance(sequence, yaml.ScalarNode):
yield sequence.value
return
if not isinstance(sequence, yaml.SequenceNode):
raise TypeError(f"'!flatten' can only flatten sequence nodes, not {sequence}")
for el in sequence.value:
if isinstance(el, yaml.SequenceNode):
yield from flatten_sequence(el)
elif isinstance(el, yaml.ScalarNode):
yield el.value
else:
raise TypeError(f"'!flatten' can only take scalar nodes, not {el}")
def construct_flat_list(loader: yaml.Loader, node: yaml.Node) -> List[str]:
"""Make a flat list, should be used with '!flatten'
Args:
loader: Unused, but necessary to pass to `yaml.add_constructor`
node: The passed node to flatten
"""
return list(flatten_sequence(node))
This recursive flattening takes advantage of the PyYAML document structure, which parses all arrays as SequenceNodes, and all values as ScalarNodes.
The behavior can be tested (and modified) in the following test function.
import pytest
def test_flatten_yaml():
# single nest
param_string = """
bread: &bread
- toast
- loafs
chicken: &chicken
- *bread
midnight_meal: !flatten
- *chicken
- *bread
"""
params = yaml.load(param_string)
assert sorted(params["midnight_meal"]) == sorted(
["toast", "loafs", "toast", "loafs"]
)
You can merge mappings then convert their keys into a list, under these conditions:
if you are using jinja2 templating and
if item order is not important
some_stuff: &some_stuff
a:
b:
c:
combined_stuff:
<<: *some_stuff
d:
e:
f:
{{ combined_stuff | list }}

Consecutive - in a YAML file

While going through serverless basic setup, I came across a YAML file that has two consecutive - in the definition.
Following is the YAML
# serverless.yml
service: myService
provider:
name: aws
iam:
role:
statements:
- Effect: 'Allow'
Action:
- 's3:ListBucket'
# You can put CloudFormation syntax in here. No one will judge you.
# Remember, this all gets translated to CloudFormation.
Resource: { 'Fn::Join': ['', ['arn:aws:s3:::', { 'Ref': 'ServerlessDeploymentBucket' }]] }
- Effect: 'Allow'
Action:
- 's3:PutObject'
Resource:
Fn::Join:
- ''
- - 'arn:aws:s3:::'
- 'Ref': 'ServerlessDeploymentBucket'
- '/*'
functions:
functionOne:
handler: handler.functionOne
memorySize: 512
Here we can see that in - - 'arn:aws:s3:::' there are two consecutive -. Can somebody help me understand what does that mean?
I'm referring to https://www.serverless.com/framework/docs/providers/aws/guide/functions/.
Thanks in advance.
There is no special meaning of - -; both dashes act according to their general semantics:
A - starts a YAML sequence item. Sibling sequence items at the same indentation level form a sequence. So, this is a sequence consisting of two sequence items:
- a
- b
Now the content of a sequence item is any valid YAML node. A sequence is a valid YAML node. Thus, this is allowed:
- a
- - b
- c
This YAML document is a sequence with two items, the first one being a scalar a, the second one being a nested sequence with two items, the scalars b and c. YAML calls this compact notation because the usual notation would be
- a
-
- b
- c
This longer notation allows giving the nested sequence an anchor (e.g. &a), a tag (e.g. !!seq) or both at its header line. Those are rather exotic features that are seldom used, and if don't need them, you can use the compact notation instead, which leads to two - -.

Performance: Replacing Series values with keys from a Dictionary in Python

I have a data series that contains various names of the same organizations. I want harmonize these names into a given standard using a mapping dictionary. I am currently using a nested for loop to iterate through each series element and if it is within the dictionary's values, I update the series value with the dictionary key.
# For example, corporation_series is:
0 'Corp1'
1 'Corp-1'
2 'Corp 1'
3 'Corp2'
4 'Corp--2'
dtype: object
# Dictionary is:
mapping_dict = {
'Corporation_1': ['Corp1', 'Corp-1', 'Corp 1'],
'Corporation_2': ['Corp2', 'Corp--2'],
}
# I use this logic to replace the values in the series
for index, value in corporation_series.items():
for key, list in mapping_dict.items():
if value in list:
corporation_series = corporation_series.replace(value, key)
So, if the series has a value of 'Corp1', and it exists in the dictionary's values, the logic replaces it with the corresponding key of corporations. However, it is an extremely expensive method. Could someone recommend me a better way of doing this operation? Much appreciated.
I found a solution by using python's .map function. In order to use .map, I had to invert my dictionary:
# Inverted Dict:
mapping_dict = {
'Corp1': ['Corporation_1'],
'Corp-1': ['Corporation_1'],
'Corp 1': ['Corporation_1'],
'Corp2': ['Corporation_2'],
'Corp--2':['Corporation_2'],
}
# use .map
corporation_series.map(newdict)
Instead of 5 minutes of processing, took around 5s. While this is works, I sure there are better solutions out there. Any suggestions would be most welcome.

Load YAML without expanding tags?

I am loading YAML files (specifically CloudFormation templates) which may contain custom tags (e.g. !Ref) that I want to treat as ordinary strings, i.e. YAML.safe_load('Foo: !Bar baz') would result in {"Foo"=>"!Bar baz"} or something similar. This is because I want to traverse and manipulate the template before dumping it back out. I would prefer not to have to add_tag everything under https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/intrinsic-function-reference.html. I am currently using Psych and Ruby 2.0, but neither is a strict requirement.
Update 1: I meant to say that answers based on Ruby versions newer than 2.0 are fine.
Update 2: I added the CloudFormation tag to this case because registering a bunch of !X -> Fn::X conversions may turn out to be the least bad solution and I have no need for a general Ruby question at this point.
OK, let's suppose you got {"Foo"=>"!Bar baz"} after parsing YAML.
You do something with it, and then you want to convert it back into YAML?
{"Foo" => "!Bar baz"}.to_yaml would result in Foo: "!Bar baz" -- which is not what you started with (it's a string now, tags aren't evaluated).
Going the way of parsing YAML is not trivial and perhaps something else should be done instead.
You should not need to have to create each and every type, what you
would need to do is make a generic tag handling routine, that looks at
the type of node the tag is on (mapping, sequence, scalar), then
creates such a node as a Ruby type on which the tag can be attached.
I don't know how to do that with Psych and Ruby, but you indicated
neither is a strict requirement, and most of the hard work
for this kind of round-tripping in ruamel.yaml for Python
(disclaimer: I am the author of that package).
If this is your input
file input.yaml:
Foo: !Bar baz
N1:
- !mytaggedmaptype
parm1: 3
parm3: 4
- !mytaggedseqtype
- 8
- 9
N2: &someanchor1
a: "some stuff"
b: 0.2e+1
f: |
within a literal scalar newlines
are preserved
N3: &someanchor2
c: 0x3
b: 4 # this value is not taken, as the first entry found is taken
['the', 'answer']: still unknown
{version: 28}: tested!
N4:
d: 5.000
<<: [*someanchor1, *someanchor2]
Then this Python (3) program:
import sys
from pathlib import Path
import ruamel.yaml
yaml_in = Path('input.yaml')
yaml_out = Path('output.yaml')
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
# uncomment next line if your YAML is the outdated version 1.1 YAML but has no tag
# yaml.version = (1, 1)
data = yaml.load(yaml_in)
# do your updating here
data['Foo'].value = 'hello world!' # see the first of the notes
data['N1'][0]['parm3'] = 4444
data['N1'][0].insert(1, 'parm2', 222)
data['N1'][1][1] = 9999
data['N3'][('the', 'answer')] = 42
# and dump to file
yaml.dump(data, yaml_out)
creates output.yaml:
Foo: !Bar hello world!
N1:
- !mytaggedmaptype
parm1: 3
parm2: 222
parm3: 4444
- !mytaggedseqtype
- 8
- 9999
N2: &someanchor1
a: "some stuff"
b: 0.2e+1
f: |
within a literal scalar newlines
are preserved
N3: &someanchor2
c: 0x3
b: 4 # this value is not taken, as the first entry found is taken
['the', 'answer']: 42
{version: 28}: tested!
N4:
d: 5.000
<<: [*someanchor1, *someanchor2]
Please note:
you can update tagged scalars while preserving the tag on the
scalar, but since you replace such a scalar with its assignment
(instead of updating a value as with lists (sequences/arrays) or
dicts (mappings/hashes), you cannot just assign the new value or
you'll lose the tagging information, you have to update the .value attribute.
things like anchors, merges, comments, quotes are preserved, as are
special forms of integers (hex, octal, etc) and floats.
For YAML sequences that are mapping keys, you need to use a tuple
(('the', 'answer')) instead of a sequence (['the', 'answer']),
as Python doesn't allow mutable keys in mappings. And for YAM
mappings that are mapping keys you would need to use the immutable
Mapping from
collections.abc.
(I am not sure if Psych supports these kind of valid YAML keys)
See this if you need to update anchored/aliased scalars

Numbered list as YAML array

Instead of
key:
- thisvalue
- thatvalue
- anothervalue
I would like to have
key:
1. thisvalue
2. thatvalue
3. anothervalue
purely for human readability, with the same interpretation of {key: [thisvalue, thatvalue, anothervalue]}.
This doesn't seem to be part of the basic YAML syntax, but is there a way to achieve this - perhaps using some of the advanced arcanery that's possible in YAML?
(I realize that this can be approximated by writing the list as:
key:
- 1. thisvalue
- 2. thatvalue
- 3. anothervalue
but this is an ugly hack and I'd prefer a solution where the numbers had semantic purpose, rather than being just part of the value's text, that also requires being parsed and removed.)
There is no way to do that in YAML. You can however use a normal nesting of elements and then during parsing generate an array/list/dictionary based on those:
my_numbered_pseudo_list:
1: a
2: b
3: c
...
n: x
When you load the example from above you will get the dictionary with key "my_numbered_pseudo_list" and its value as a dictionary containing all nested pairs {"1" : "a", "2" : "b", ..., "n" : "x"}. Here is an example how it will look like:
import yaml
doc = '''
list:
1: a
2: b
3: c
4: d
'''
y = yaml.load(doc);
list = []
for i in y['list']:
list.append(y['list'].get(i))
print list
This will give you
['a', 'b', 'c', 'd']
If you want to make sure that the order is actually kept in the YAML file you have to do some sorting in order to get an ordered final list where the order described in the YAML file is kept.
I have also seen people use ordered hash calls on the resulting dictionary (here: "list") (such as in Ruby which I am not familiar with) so you might want to dig a little bit more.
IMPORTANT!
Read here and here. In short to make sure you get a really ordered list from your YAML you have to sort the dictionary you have as a pseudo-list by key and then extract the values and append those to your final list.
When using Python, in order to be able to preserve the key order in YAML mappings (and comments, anchor names etc), the mappings are read into special ordereddict derivatives if you use ruamel.yaml (diclaimer: I am the author) and the RoundTripLoader.
Those function as dicts transparently, but with that, and using the syntax proposed by rbaleksandar in her/his answer, you can just do:
import ruamel.yaml as yaml
yaml_str = """\
key:
1: thisvalue
2: thatvalue
3: anothervalue
4: abc
5: def
6: ghi
"""
data = yaml.load(yaml_str, Loader=yaml.RoundTripLoader)
y = data['key']
print y.keys()[2:5]
print y.values()[2:5]
print y.items()[2:5]
to get:
[3, 4, 5]
['anothervalue', 'abc', 'def']
[(3, 'anothervalue'), (4, 'abc'), (5, 'def')]
without any special effort after loading the data.
The YAML specs state that the key ordering is not guaranteed, but in the YAML file they are of course ordered. If the parser doesn't throw this information awasy, things are much more useful e.g. for comparison between revisions of a file.

Resources