pyyaml parse data with tag - yaml

I have yaml data like the input below and i need output as key value pairs
Input
a="""
--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess
code:
- '716'
- '718'
id:
- 488
- 499
"""
ouput needed
{'code': ['716', '718'], 'id': [488, 499]}
The default constructor was giving me an error. I tried adding new constructor and now its not giving me error but i am not able to get key value pairs.
FYI, If i remove the !ruby/hash:ActiveSupport::HashWithIndifferentAccess line from my yaml then it gives me desired output.
def new_constructor(loader, tag_suffix, node):
if type(node.value)=='list':
val=''.join(node.value)
else:
val=node.value
val=node.value
ret_val="""
{0}
""".format(val)
return ret_val
yaml.add_multi_constructor('', new_constructor)
yaml.load(a)
output
"\n [(ScalarNode(tag=u'tag:yaml.org,2002:str', value=u'code'), SequenceNode(tag=u'tag:yaml.org,2002:seq', value=[ScalarNode(tag=u'tag:yaml.org,2002:str', value=u'716'), ScalarNode(tag=u'tag:yaml.org,2002:str', value=u'718')])), (ScalarNode(tag=u'tag:yaml.org,2002:str', value=u'id'), SequenceNode(tag=u'tag:yaml.org,2002:seq', value=[ScalarNode(tag=u'tag:yaml.org,2002:int', value=u'488'), ScalarNode(tag=u'tag:yaml.org,2002:int', value=u'499')]))]\n "
Please suggest.

This is not a solution using PyYAML, but I recommend using ruamel.yaml instead. If for no other reason, it's more actively maintained than PyYAML. A quote from the overview
Many of the bugs filed against PyYAML, but that were never acted upon, have been fixed in ruamel.yaml
To load that string, you can do
import ruamel.yaml
parser = ruamel.yaml.YAML()
obj = parser.load(a) # as defined above.

I strongly recommend following #Andrew F answer, but in case you
wonder why your code did not get the proper result, that is because
you don't correctly process the node under the tag in your tag
handling.
Although the node's value is a list (of tuples with key value pairs),
you should test for the type of the node itself (using isinstance)
and then hand it over to the "normal" mapping processing routine as
the tag is on a mapping:
import yaml
from yaml.loader import SafeLoader
a = """\
--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess
code:
- '716'
- '718'
id:
- 488
- 499
"""
def new_constructor(loader, tag_suffix, node):
if isinstance(node, yaml.nodes.MappingNode):
return loader.construct_mapping(node, deep=True)
raise NotImplementedError
yaml.add_multi_constructor('', new_constructor, Loader=SafeLoader)
data = yaml.load(a, Loader=SafeLoader)
print(data)
which gives:
{'code': ['716', '718'], 'id': [488, 499]}
You should not use PyYAML's yaml.load(), it is documented to be potentially unsafe
and above all it is not necessary. Just add the new constructor to the SafeLoader.

Related

Writing to a Yaml file

I am using python to write in to a YAML file. But I am unable to write it in a specific format as I require
awesome_people:
name: Awesome People
entities:
- device_tracker.dad_smith
- device_tracker.mom_smith
I am getting problem in the entities part, as I am unable to create a list with proper indents as in the above YAML.
How can I create the above exact format?
You can do this with ruamel.yaml, setting the sequence indent from default two to four,
and the offset of the dash+space (which need a minimum indent of two), from zero to two.
You should leave the mapping indent unchanged from the default:
import sys
import ruamel.yaml
ent = ['device_tracker.dad_smith', 'device_tracker.mom_smith']
data = dict(awesome_people = dict(name='Awesome People', entities=ent))
yaml_str = """\
"""
yaml = ruamel.yaml.YAML()
yaml.indent(sequence=4, offset=2)
# print(data)
yaml.dump(data, sys.stdout)
which gives:
awesome_people:
name: Awesome People
entities:
- device_tracker.dad_smith
- device_tracker.mom_smith

How to use Chunk Size for kedro.extras.datasets.pandas.SQLTableDataSet in the kedro pipeline?

I am using kedro.extras.datasets.pandas.SQLTableDataSet and would like to use the chunk_size argument from pandas. However, when running the pipeline, the table gets treated as a generator instead of a pd.dataframe().
How would you use the chunk_size within the pipeline?
My catalog:
table_name:
type: pandas.SQLTableDataSet
credentials: redshift
table_name : rs_table_name
layer: output
save_args:
if_exists: append
schema: schema.name
chunk_size: 1000
Looking at the latest pandas doc, the actual kwarg to be used is chunksize, not chunk_size. Please see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html.
Since kedro only wraps your save_args and passes them to pd.DataFrame.to_sql these need to match:
def _save(self, data: pd.DataFrame) -> None:
try:
data.to_sql(**self._save_args)
except ImportError as import_error:
raise _get_missing_module_error(import_error) from import_error
except NoSuchModuleError as exc:
raise _get_sql_alchemy_missing_error() from exc
EDIT: Once you have this working in your pipeline, the docs show that pandas.DataFrame.read_sql with chunksize set will return type Iterator[DataFrame]. This means that in your node function, you should iterate over the input (and annotate accordingly, if appropriate) such as:
def my_node_func(input_dfs: Iterator[pd.DataFrame], *args):
for df in input_dfs:
...
This works for the latest version of pandas. I have noticed, however, that pandas is aligning the API so that read_csv with chunksize set returns a ContextManager from pandas>=1.2 so I would expect this change to occur in read_sql as well.

Validate Python TypedDict at runtime

I'm working in a Python 3.8+ Django/Rest-Framework environment enforcing types in new code but built on a lot of untyped legacy code and data. We are using TypedDicts extensively for ensuring that data we are generating passes to our TypeScript front-end with the proper data type.
MyPy/PyCharm/etc. does a great job of checking that our new code spits out data that conforms, but we want to test that the output of our many RestSerializers/ModelSerializers fits the TypeDict. If I have a serializer and typed dict like:
class PersonSerializer(ModelSerializer):
class Meta:
model = Person
fields = ['first', 'last']
class PersonData(TypedDict):
first: str
last: str
email: str
and then run code like:
person_dict: PersonData = PersonSerializer(Person.objects.first()).data
Static type checkers don't be able to figure out that person_dict is missing the required email key, because (by design of PEP-589) it is just a normal dict. But I can write something like:
annotations = PersonData.__annotations__
for k in annotations:
assert k in person_dict # or something more complex.
assert isinstance(person_dict[k], annotations[k])
and it will find that email is missing from the data of the serializer. This is well and good in this case, where I don't have any changes introduced by from __future__ import annotations (not sure if this would break it), and all my type annotations are bare types. But if PersonData were defined like:
class PersonData(TypedDict):
email: Optional[str]
affiliations: Union[List[str], Dict[int, str]]
then isinstance is not good enough to check if the data passes (since "Subscripted generics cannot be used with class and instance checks").
What I'm wondering is if there already exists a callable function/method (in mypy or another checker) that would allow me to validate a TypedDict (or even a single variable, since I can iterate a dict myself) against an annotation and see if it validates?
I'm not concerned about speed, etc., since the point of this is to check all our data/methods/functions once and then remove the checks later once we're happy that our current data validates.
The simplest solution I found works using pydantic.
from typing import cast, TypedDict
import pydantic
class SomeDict(TypedDict):
val: int
name: str
# this could be a valid/invalid declaration
obj: SomeDict = {
'val': 12,
'name': 'John',
}
# validate with pydantic
try:
obj = cast(SomeDict, pydantic.create_model_from_typeddict(SomeDict)(**obj).dict())
except pydantic.ValidationError as exc:
print(f"ERROR: Invalid schema: {exc}")
EDIT: When type checking this, it currently returns an error, but works as expected. See here: https://github.com/samuelcolvin/pydantic/issues/3008
You may want to have a look at https://pypi.org/project/strongtyping/. This may help.
In the docs you can find this example:
from typing import List, TypedDict
from strongtyping.strong_typing import match_class_typing
#match_class_typing
class SalesSummary(TypedDict):
sales: int
country: str
product_codes: List[str]
# works like expected
SalesSummary({"sales": 10, "country": "Foo", "product_codes": ["1", "2", "3"]})
# will raise a TypeMisMatch
SalesSummary({"sales": "Foo", "country": 10, "product_codes": [1, 2, 3]})
A little bit of a hack, but you can check two types using mypy command line -c options. Just wrap it in a python function:
import subprocess
def is_assignable(type_to, type_from) -> bool:
"""
Returns true if `type_from` can be assigned to `type_to`,
e. g. type_to := type_from
Example:
>>> is_assignable(bool, str)
False
>>> from typing import *
>>> is_assignable(Union[List[str], Dict[int, str]], List[str])
True
"""
code = "\n".join((
f"import typing",
f"type_to: {type_to}",
f"type_from: {type_from}",
f"type_to = type_from",
))
return subprocess.call(("mypy", "-c", code)) == 0
You could do something like this:
def validate(typ: Any, instance: Any) -> bool:
for property_name, property_type in typ.__annotations__.items():
value = instance.get(property_name, None)
if value is None:
# Check for missing keys
print(f"Missing key: {property_name}")
return False
elif property_type not in (int, float, bool, str):
# check if property_type is object (e.g. not a primitive)
result = validate(property_type, value)
if result is False:
return False
elif not isinstance(value, property_type):
# Check for type equality
print(f"Wrong type: {property_name}. Expected {property_type}, got {type(value)}")
return False
return True
And then test some object, e.g. one that was passed to your REST endpoint:
class MySubModel(TypedDict):
subfield: bool
class MyModel(TypedDict):
first: str
last: str
email: str
sub: MySubModel
m = {
'email': 'JohnDoeAtDoeishDotCom',
'first': 'John'
}
assert validate(MyModel, m) is False
This one prints the first error and returns bool, you could change that to exceptions, possibly with all the missing keys. You could also extend it to fail on additional keys than defined by the model.
I like your solution!. In order to avoid iteration fixes for some user, I added some code to your solution :D
def validate_custom_typed_dict(instance: Any, custom_typed_dict:TypedDict) -> bool|Exception:
key_errors = []
type_errors = []
for property_name, type_ in my_typed_dict.__annotations__.items():
value = instance.get(property_name, None)
if value is None:
# Check for missing keys
key_errors.append(f"\t- Missing property: '{property_name}' \n")
elif type_ not in (int, float, bool, str):
# check if type is object (e.g. not a primitive)
result = validate_custom_typed_dict(type_, value)
if result is False:
type_errors.append(f"\t- '{property_name}' expected {type_}, got {type(value)}\n")
elif not isinstance(value, type_):
# Check for type equality
type_errors.append(f"\t- '{property_name}' expected {type_}, got {type(value)}\n")
if len(key_errors) > 0 or len(type_errors) > 0:
error_message = f'\n{"".join(key_errors)}{"".join(type_errors)}'
raise Exception(error_message)
return True
some console output:
Exception:
- Missing property: 'Combined_cycle'
- Missing property: 'Solar_PV'
- Missing property: 'Hydro'
- 'timestamp' expected <class 'str'>, got <class 'int'>
- 'Diesel_engines' expected <class 'float'>, got <class 'int'>

Remove a field from YAML OpenAPI specification

I need to remove the tags field from each of the methods in my OpenAPI spec.
The spec must be in YAML format, as converting to JSON causes issues later on when publishing.
I couldn't find a ready tool for that, and my programming skills are insufficient. I tried Python with ruamel.yaml, but could not achieve anything.
I'm open to any suggestions how to approach this - a repo with a ready tool somewhere, a hint on what to try in Python... I'm out of my own ideas.
Maybe a regex that catches all cases all instances of tags so I can do a search and replace with Python, replacing them with nothing? Empty lines don't seem to break the publishing engine.
Here's a sample YAML piece (I know this is not a proper spec, just want to show where in the YAML tags sits)
openapi: 3.0.0
info:
title: ""
description: ""
paths:
/endpoint
get:
tags:
-
tag1
-
tag3
#rest of GET definition
post:
tags:
- tag2
/anotherEndpoint
post:
tags:
- tag1
I need to get rid of all tags arrays entirely (not just make them empty)
I am not sure why you couldn't achieve anything with Python + ruamel.yaml. Assuing your spec
is in a file input.yaml:
import sys
from pathlib import Path
import ruamel.yaml
in_file = Path('input.yaml')
out_file = Path('output.yaml')
yaml = ruamel.yaml.YAML()
yaml.indent(mapping=4, sequence=4, offset=2)
yaml.preserve_quotes = True
data = yaml.load(in_file)
# if you only have the three instances of 'tags', you can hard-code them:
# del data['paths']['/endpoint']['get']['tags']
# del data['paths']['/endpoint']['post']['tags']
# del data['paths']['/anotherEndpoint']['post']['tags']
# generic recursive removal of any key names 'tags' in the datastructure:
def rm_tags(d):
if isinstance(d, dict):
if 'tags' in d:
del d['tags']
for k in d:
rm_tags(d[k])
elif isinstance(d, list):
for item in d:
rm_tags(item)
rm_tags(data)
yaml.dump(data, out_file)
which gives as output.yaml:
openapi: 3.0.0
info:
title: ""
description: ""
paths:
/endpoint:
get: {}
post: {}
/anotherEndpoint:
post: {}
You can write back data to input.yaml if you need that.
Please note that normally the comment #rest of GET definition would be preserved, but
not now as it is associated during loading with the key before it and that key gets deleted.

Prettify YAML with comments

1. Summary
I can't find, how I can automatically prettify my YAML files.
2. Data
Example:
    I have SashaPrettifyYAML.yaml file:
sasha_commands:
# Sasha comment
sasha_command_help: {call: sublime.command_help, caption: 'Sasha Command: Command Help'}
3. Expected behavior
I want to delete {braces}:
sasha_commands:
# Sasha comment
sasha_command_help:
call: sublime.command_help
caption: 'Sasha Command: Command Help'
4. Not helped
Pretty YAML (based on PyYAML) and online formatters as YAML Formatter and OnlineYAMLTools delete comments;
I can't find the required option in ruamel.yaml.cmd;
align-yaml align, not prettify YAML file.
There is no option to do this in ruamel.yaml.cmd, but it is fairly straightforward to do this with a small python program and using ruamel.yaml, by loading and dumping in round-trip mode (the default).
The only thing you need to do is make sure the flow-style on the data-structure that is the value for the key sasha_command_help is set to block-style (which is how I interpret your definition of "prettifying YAML"):
import sys
import ruamel.yaml
yaml_str = """\
sasha_commands:
# Sasha comment
sasha_command_help: {call: sublime.command_help, caption: 'Sasha Command: Command Help'}
"""
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load(yaml_str)
data['sasha_commands']['sasha_command_help'].fa.set_block_style()
yaml.dump(data, sys.stdout)
this will exactly give the output you expect.
A recursive data structure walker can be found in scalarstring.py in the ruamel.yaml source, and adapted to make a generic "make-everything-block-style" routine:
import sys
import ruamel.yaml
def block_style(base):
"""
This routine walks over a simple, i.e. consisting of dicts, lists and
primitives, tree loaded from YAML. It recurses into dict values and list
items, and sets block-style on these.
"""
if isinstance(base, dict):
for k in base:
try:
base.fa.set_block_style()
except AttributeError:
pass
block_style(base[k])
elif isinstance(base, list):
for elem in base:
try:
base.fa.set_block_style()
except AttributeError:
pass
block_style(elem)
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
file_in = sys.argv[1]
file_out = sys.argv[2]
with open(file_in) as fp:
data = yaml.load(fp)
block_style(data)
with open(file_out, 'w') as fp:
yaml.dump(data, fp)
If you store the above in prettifyyaml.py you can call it with:
python prettifyyaml.py SashaPrettifyYAML.yaml Prettified.yaml
Since you are already using single quotes around the scalar that has embedded spaces, you won't see a change if you leave out yaml.preserve_quotes = True. But if you had used a double quoted scalar then that line makes sure the double quotes are preserved.
I had the same problem. I wrote my own YAML beautifier https://github.com/wangkuiyi/yamlfmt. I hope it helps.
I tried top results from Google, but none of them address the requirements of https://sqlflow.org/sqlflow, which I am leading:
https://pypi.org/project/yamlfmt cannot handle a file of multiple YAML documents separated by ---
https://github.com/devopyio/yamlfmt cannot handle multiple files.
https://github.com/miekg/yamlfmt/blob/master/fmt.go cannot replace (inline edit) the input files.
You can use yq tool - it's easy to install and use, and it's well maintained.
Supposing you have example.yml file to format, it can be processed by following ways:
from file: yq r --unwrapScalar -p pv -P example.yml '*'
from stdin: cat example.yml | yq r --unwrapScalar -p pv -P - '*'

Resources