Why can't we convert flat columns of awkward1 arrays `to_parquet`? - parquet

A follow up from this question; Best way to save a dict of awkward1 arrays?
To save multiple columns of nested awkward1 arrays (with varying length);
import awkward1 as ak
dog = ak.from_iter([[1, 2], [5]])
cat = ak.from_iter([[4]])
pets = ak.zip({"dog": dog[np.newaxis], "cat": cat[np.newaxis]}, depth_limit=1)
ak.to_parquet(pets, "pets.parquet")
Unfortunately, this doesn't seem to work for flat lists;
import awkward1 as ak
dog = ak.from_iter([1, 2, 5])
cat = ak.from_iter([4])
pets = ak.zip({"dog": dog[np.newaxis], "cat": cat[np.newaxis]}, depth_limit=1)
ak.to_parquet(pets, "pets.parquet")
creates the error;
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-31-7f3a7fefb261> in <module>
3 cat = ak.from_iter([3])
4 pets = ak.zip({"dog": dog[np.newaxis], "cat": cat[np.newaxis]}, depth_limit=1)
----> 5 ak.to_parquet(pets, "pets.parquet")
~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/awkward/operations/convert.py in to_parquet(array, where, explode_records, list_to32, string_to32, bytestring_to32, **options)
2983 layout = to_layout(array, allow_record=False, allow_other=False)
2984 iterator = batch_iterator(layout)
-> 2985 first = next(iterator)
2986
2987 if "schema" not in options:
~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/awkward/operations/convert.py in batch_iterator(layout)
2978 )
2979 yield pyarrow.RecordBatch.from_arrays(
-> 2980 pa_arrays, schema=pyarrow.schema(pa_fields)
2981 )
2982
~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_arrays()
TypeError: object of type 'pyarrow.lib.Tensor' has no len()
What is the reason for encountering this error?

What you found is a bug, and now it is fixed: https://github.com/scikit-hep/awkward-1.0/pull/799
What's happening here is that pyarrow can't write pyarrow.lib.Tensor (regular-length lists, such as the one you created with np.newaxis) to Parquet files. Parquet files don't have a concept of "regular-length list," so that makes sense. But rather than converting it, pyarrow hits an unhandled case, in which it fails to find the length of that pyarrow.lib.Tensor. (It's a little odd that pyarrow.lib.Tensor doesn't have a __len__ method, but that's another thing.)
Anyway, with version 1.2.0 of Awkward Array, we'll simply convert regular-length lists into (in principle) variable-length lists when writing to Parquet, since the format doesn't have that type. According to the schedule, version 1.2.0 will be released tomorrow. (This bug-fix is likely the last prerelease.)

Related

With ruamel.yaml how can I conditionally convert flow maps to block maps based on line length?

I'm working on a ruamel.yaml (v0.17.4) based YAML reformatter (using the RoundTrip variant to preserve comments).
I want to allow a mix of block- and flow-style maps, but in some cases, I want to convert a flow-style map to use block-style.
In particular, if the flow-style map would be longer than the max line length^, I want to convert that to a block-style map instead of wrapping the line somewhere in the middle of the flow-style map.
^ By "max line length" I mean the best_width that I configure by setting something like yaml.width = 120 where yaml is a ruamel.yaml.YAML instance.
What should I extend to achieve this? The emitter is where the line-length gets calculated so wrapping can occur, but I suspect that is too late to convert between block- and flow-style. I'm also concerned about losing comments when I switch the styles. Here are some possible extension points, can you give me a pointer on where I'm most likely to have success with this?
Emitter.expect_flow_mapping() probably too late for converting flow->block
Serializer.serialize_node() probably too late as it consults node.flow_style
RoundTripRepresenter.represent_mapping() maybe? but this has no idea about line length
I could also walk the data before calling yaml.dump(), but this has no idea about line length.
So, where should I and where can I adjust the flow_style whether a flow-style map would trigger line wrapping?
What I think the most accurate approach is when you encounter a flow-style mapping in the dumping process is to first try to emit it to a buffer and then get the length of the buffer and if that combined with the column that you are in, actually emit block-style.
Any attempt to guesstimate the length of the output without actually trying to write that part of a tree is going to be hard, if not impossible to do without doing the actual emit. Among other things the dumping process actually dumps scalars and reads them back to make sure no quoting needs to be forced (e.g. when you dump a string that reads back like a date). It also handles single key-value pairs in a list in a special way ( [1, a: 42, 3] instead of the more verbose [1, {a: 42}, 3]. So a simple calculation of the length of the scalars that are the keys and values and separating comma, colon and spaces is not going to be precise.
A different approach is to dump your data with a large line width and parse the output and make a set of line numbers for which the line is too long according to the width that you actually want to use. After loading that output back you can walk over the data structure recursively, inspect the .lc attribute to determine the line number on which a flow style mapping (or sequence) started and if that line number is in the set you built beforehand change the mapping to block style. If you have nested flow-style collections, you might have to repeat this process.
If you run the following, the initial dumped value for quote will be on one line.
The change_to_block method as presented changes all mappings/sequences that are too long
that are on one line.
import sys
import ruamel.yaml
yaml_str = """\
movie: bladerunner
quote: {[Batty, Roy]: [
I have seen things you people wouldn't believe.,
Attack ships on fire off the shoulder of Orion.,
I watched C-beams glitter in the dark near the Tannhäuser Gate.,
]}
"""
class Blockify:
def __init__(self, width, only_first=False, verbose=0):
self._width = width
self._yaml = None
self._only_first = only_first
self._verbose = verbose
#property
def yaml(self):
if self._yaml is None:
self._yaml = y = ruamel.yaml.YAML(typ=['rt', 'string'])
y.preserve_quotes = True
y.width = 2**16
return self._yaml
def __call__(self, d):
pass_nr = 0
changed = [True]
while changed[0]:
changed[0] = False
try:
s = self.yaml.dumps(d)
except AttributeError:
print("use 'pip install ruamel.yaml.string' to install plugin that gives 'dumps' to string")
sys.exit(1)
if self._verbose > 1:
print(s)
too_long = set()
max_ll = -1
for line_nr, line in enumerate(s.splitlines()):
if len(line) > self._width:
too_long.add(line_nr)
if len(line) > max_ll:
max_ll = len(line)
if self._verbose > 0:
print(f'pass: {pass_nr}, lines: {sorted(too_long)}, longest: {max_ll}')
sys.stdout.flush()
new_d = self.yaml.load(s)
self.change_to_block(new_d, too_long, changed, only_first=self._only_first)
d = new_d
pass_nr += 1
return d, s
#staticmethod
def change_to_block(d, too_long, changed, only_first):
if isinstance(d, dict):
if d.fa.flow_style() and d.lc.line in too_long:
d.fa.set_block_style()
changed[0] = True
return # don't convert nested flow styles, might not be necessary
# don't change keys if any value is changed
for v in d.values():
Blockify.change_to_block(v, too_long, changed, only_first)
if only_first and changed[0]:
return
if changed[0]: # don't change keys if value has changed
return
for k in d:
Blockify.change_to_block(k, too_long, changed, only_first)
if only_first and changed[0]:
return
if isinstance(d, (list, tuple)):
if d.fa.flow_style() and d.lc.line in too_long:
d.fa.set_block_style()
changed[0] = True
return # don't convert nested flow styles, might not be necessary
for elem in d:
Blockify.change_to_block(elem, too_long, changed, only_first)
if only_first and changed[0]:
return
blockify = Blockify(96, verbose=2) # set verbose to 0, to suppress progress output
yaml = ruamel.yaml.YAML(typ=['rt', 'string'])
data = yaml.load(yaml_str)
blockified_data, string_output = blockify(data)
print('-'*32, 'result:', '-'*32)
print(string_output) # string_output has no final newline
which gives:
movie: bladerunner
quote: {[Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]}
pass: 0, lines: [1], longest: 186
movie: bladerunner
quote:
[Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]
pass: 1, lines: [2], longest: 179
movie: bladerunner
quote:
[Batty, Roy]:
- I have seen things you people wouldn't believe.
- Attack ships on fire off the shoulder of Orion.
- I watched C-beams glitter in the dark near the Tannhäuser Gate.
pass: 2, lines: [], longest: 67
-------------------------------- result: --------------------------------
movie: bladerunner
quote:
[Batty, Roy]:
- I have seen things you people wouldn't believe.
- Attack ships on fire off the shoulder of Orion.
- I watched C-beams glitter in the dark near the Tannhäuser Gate.
Please note that when using ruamel.yaml<0.18 the sequence [Batty, Roy] never will be in block style
because the tuple subclass CommentedKeySeq does never get a line number attached.

Huggingface load_dataset() method how to assign the "features" argument?

I'm trying to load a custom dataset to use for finetuning a Huggingface model. My data is a csv file with 2 columns: one is 'sequence' which is a string , the other one is 'label' which is also a string, with 8 classes. I want to load my dataset and assign the type of the 'sequence' column to 'string' and the type of the 'label' column to 'ClassLabel'
my code is this:
from datasets import Features
from datasets import load_dataset
ft = Features({'sequence':'str','label':'ClassLabel'})
mydataset = load_dataset("csv", data_files="mydata.csv",features= ft)
running this code, I got the following error:
TypeError Traceback (most recent call last)
<ipython-input-59-45fedff522e8> in <module>()
7
8 mydataset = load_dataset("csv", data_files="mydata.csv",
----> 9 features= ft)
10
8 frames
/usr/local/lib/python3.7/dist-packages/datasets/features/features.py in get_nested_type(schema)
794
795 # Other objects are callable which returns their data type (ClassLabel, Array2D, Translation, Arrow datatype creation methods)
--> 796 return schema()
TypeError: 'str' object is not callable
could someone help please?
You should define your Features similarly to what is in here:
from datasets import Features, Value, ClassLabel
from datasets import load_dataset
class_names = ['class_label_1', 'class_label_2']
ft = Features({'sequence': Value('string'), 'label': ClassLabel(names=class_names)})
mydataset = load_dataset("csv", data_files="mydata.csv",features=ft)

YAML mapping order not preserved when using alias and yamlordereddictloader loader

I want to load a YAML file into Python as an OrderedDict. I am using yamlordereddictloader to preserve ordering.
However, I notice that the aliased object is placed "too soon" in the OrderedDict in the output.
How can I preserve the order of this mapping when read into Python, ideally as an OrderedDict? Is it possible to achieve this result without writing some custom parsing?
Notes:
I'm not particularly concerned with the method used, as long as the end result is the same.
Using sequences instead of mappings is problematic because they can result in nested output, and I can't simply flatten everything (some nestedness is appropriate).
When I try to just use !!omap, I cannot seem to merge the aliased mapping (d1.dt) into the d2 mapping.
I'm in Python 3.6, if I don't use this loader or !!omap order is not preserved (apparently contrary to the top 'Update' here: https://stackoverflow.com/a/21912744/2343633)
import yaml
import yamlordereddictloader
yaml_file = """
d1:
id:
nm1: val1
dt: &dt
nm2: val2
nm3: val3
d2: # expect nm4, nm2, nm3
nm4: val4
<<: *dt
"""
out = yaml.load(yaml_file, Loader=yamlordereddictloader.Loader)
keys = [x for x in out['d2']]
print(keys) # ['nm2', 'nm3', 'nm4']
assert keys==['nm4', 'nm2', 'nm3'], "order from YAML file is not preserved, aliased keys placed too early"
Is it possible to achieve this result without writing some custom parsing?
Yes. You need to override the method flatten_mapping from SafeConstructor. Here's a basic working example:
import yaml
import yamlordereddictloader
from yaml.constructor import *
from yaml.reader import *
from yaml.parser import *
from yaml.resolver import *
from yaml.composer import *
from yaml.scanner import *
from yaml.nodes import *
class MyLoader(yamlordereddictloader.Loader):
def __init__(self, stream):
yamlordereddictloader.Loader.__init__(self, stream)
# taken from here and reengineered to keep order:
# https://github.com/yaml/pyyaml/blob/5.3.1/lib/yaml/constructor.py#L207
def flatten_mapping(self, node):
merged = []
def merge_from(node):
if not isinstance(node, MappingNode):
raise yaml.ConstructorError("while constructing a mapping",
node.start_mark, "expected mapping for merging, but found %s" %
node.id, node.start_mark)
self.flatten_mapping(node)
merged.extend(node.value)
for index in range(len(node.value)):
key_node, value_node = node.value[index]
if key_node.tag == u'tag:yaml.org,2002:merge':
if isinstance(value_node, SequenceNode):
for subnode in value_node.value:
merge_from(subnode)
else:
merge_from(value_node)
else:
if key_node.tag == u'tag:yaml.org,2002:value':
key_node.tag = u'tag:yaml.org,2002:str'
merged.append((key_node, value_node))
node.value = merged
yaml_file = """
d1:
id:
nm1: val1
dt: &dt
nm2: val2
nm3: val3
d2: # expect nm4, nm2, nm3
nm4: val4
<<: *dt
"""
out = yaml.load(yaml_file, Loader=MyLoader)
keys = [x for x in out['d2']]
print(keys)
assert keys==['nm4', 'nm2', 'nm3'], "order from YAML file is not preserved, aliased keys placed too early"
This has not the best performance as it basically copies all key-value pairs from all mappings once each during loading, but it's working. Performance enhancement is left as an exercise for the reader :).

How to calculate shap values for ADABoost model?

I am running 3 different model (Random forest, Gradient Boosting, Ada Boost) and a model ensemble based on these 3 models.
I managed to use SHAP for GB and RF but not for ADA with the following error:
Exception Traceback (most recent call last)
in engine
----> 1 explainer = shap.TreeExplainer(model,data = explain_data.head(1000), model_output= 'probability')
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, model_output, feature_perturbation, **deprecated_options)
110 self.feature_perturbation = feature_perturbation
111 self.expected_value = None
--> 112 self.model = TreeEnsemble(model, self.data, self.data_missing)
113
114 if feature_perturbation not in feature_perturbation_codes:
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, data_missing)
752 self.tree_output = "probability"
753 else:
--> 754 raise Exception("Model type not yet supported by TreeExplainer: " + str(type(model)))
755
756 # build a dense numpy version of all the tree objects
Exception: Model type not yet supported by TreeExplainer: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
I found this link on Git that state
TreeExplainer creates a TreeEnsemble object from whatever model type we are trying to explain, and then works with that downstream. So all you would need to do is and add another if statement in the
TreeEnsemble constructor similar to the one for gradient boosting
But I really don't know how to implement it since I quite new to this.
I had the same problem and what I did, was to modify the file in the git you are commenting.
In my case I use windows so the file is in C:\Users\my_user\AppData\Local\Continuum\anaconda3\Lib\site-packages\shap\explainers but you can do double click over the error message and the file will be opened.
The next step is to add another elif as the answer of the git help says. In my case I did it from the line 404 as following:
1) Modify the source code.
...
self.objective = objective_name_map.get(model.criterion, None)
self.tree_output = "probability"
elif str(type(model)).endswith("sklearn.ensemble.weight_boosting.AdaBoostClassifier'>"): #From this line I have modified the code
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line I added
elif str(type(model)).endswith("sklearn.ensemble.forest.ExtraTreesClassifier'>"): # TODO: add unit test for this case
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
...
Note in the other models, the code of shap needs the attribute 'criterion' that the AdaBoost classifier doesn't have in a direct way. So in this case this attribute is obtained from the "weak" classifiers with the AdaBoost has been trained, that's why I add model.base_estimator_.criterion .
Finally you have to import the library again, train your model and get the shap values. I leave an example:
2) Import again the library and try:
from sklearn import datasets
from sklearn.ensemble import AdaBoostClassifier
import shap
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
ADABoost_model = AdaBoostClassifier()
ADABoost_model.fit(X, y)
shap_values = shap.TreeExplainer(ADABoost_model).shap_values(X)
shap.summary_plot(shap_values, X, plot_type="bar")
Which generates the following:
3) Get your new results:
It seems that the shap package has been updated and still does not contain the AdaBoostClassifier. Based on the previous answer, I've modified the previous answer to work with the shap/explainers/tree.py file in lines 598-610
### Added AdaBoostClassifier based on the outdated StackOverflow response and Github issue here
### https://stackoverflow.com/questions/60433389/how-to-calculate-shap-values-for-adaboost-model/61108156#61108156
### https://github.com/slundberg/shap/issues/335
elif safe_isinstance(model, ["sklearn.ensemble.AdaBoostClassifier", "sklearn.ensemble._weighted_boosting.AdaBoostClassifier"]):
assert hasattr(model, "estimators_"), "Model has no `estimators_`! Have you called `model.fit`?"
self.internal_dtype = model.estimators_[0].tree_.value.dtype.type
self.input_dtype = np.float32
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line added
Also working on testing to add this to the package :)

How to clear/overwrite all data in a shared memory?

I need to overwrite all previously written data in a shared memory (multiprocessing.shared_memory).
Here is the sample code:
from multiprocessing import shared_memory
import json
shared = shared_memory.SharedMemory(create=True, size=24, name='TEST')
data_one = {'ONE': 1, 'TWO': 2}
data_two = {'ACTIVE': 1}
_byte_data_one = bytes(json.dumps(data_one), encoding='ascii')
_byte_data_two = bytes(json.dumps(data_two), encoding='ascii')
# First write to shared memory
shared.buf[0:len(_byte_data_one)] = _byte_data_one
print(f'Data: {shared.buf.tobytes()}')
# Second write
shared.buf[0:len(_byte_data_two)] = _byte_data_two
print(f'Data: {shared.buf.tobytes()}')
shared.close()
shared.unlink()
Output:
First write: b'{"ONE": 1, "TWO": 2}\x00\x00\x00\x00'
Second write: b'{"ACTIVE": 1}WO": 2}\x00\x00\x00\x00'
The output is understandable, since the second write starts from index 0 and ends at _byte_data_two length. (shared.buf[0:len(_byte_data_two)] = _byte_data_two)
I need that every new write to shared memory to overwrite all previously data.
I've tried shared.buf[0:] = b'' before every new write to shared memory but ended up getting
ValueError: memoryview assignment: lvalue and rvalue have different structures
Also I've tried this shared.buf[0:len(_bytes_data_two)] = b'' after every new write with the same result.
Looking after this result:
First write: b'{"ONE": 1, "TWO": 2}\x00\x00\x00\x00'
Second write: b'{"ACTIVE": 1}\x00\x00\x00\x00' without extra "WO": 2}" from first write
How to overwrite all previously written data in a shared memory?
the easiest might be to create a zero filled byte array first, something like:
def set_zero_filled(sm, data):
buf = bytearray(sm.nbytes)
buf[:len(data)] = data
sm.buf[:] = buf
which you can use as:
set_zero_filled(shared, json.dumps(data_two).encode())

Resources