With ruamel.yaml how can I conditionally convert flow maps to block maps based on line length? - yaml

I'm working on a ruamel.yaml (v0.17.4) based YAML reformatter (using the RoundTrip variant to preserve comments).
I want to allow a mix of block- and flow-style maps, but in some cases, I want to convert a flow-style map to use block-style.
In particular, if the flow-style map would be longer than the max line length^, I want to convert that to a block-style map instead of wrapping the line somewhere in the middle of the flow-style map.
^ By "max line length" I mean the best_width that I configure by setting something like yaml.width = 120 where yaml is a ruamel.yaml.YAML instance.
What should I extend to achieve this? The emitter is where the line-length gets calculated so wrapping can occur, but I suspect that is too late to convert between block- and flow-style. I'm also concerned about losing comments when I switch the styles. Here are some possible extension points, can you give me a pointer on where I'm most likely to have success with this?
Emitter.expect_flow_mapping() probably too late for converting flow->block
Serializer.serialize_node() probably too late as it consults node.flow_style
RoundTripRepresenter.represent_mapping() maybe? but this has no idea about line length
I could also walk the data before calling yaml.dump(), but this has no idea about line length.
So, where should I and where can I adjust the flow_style whether a flow-style map would trigger line wrapping?

What I think the most accurate approach is when you encounter a flow-style mapping in the dumping process is to first try to emit it to a buffer and then get the length of the buffer and if that combined with the column that you are in, actually emit block-style.
Any attempt to guesstimate the length of the output without actually trying to write that part of a tree is going to be hard, if not impossible to do without doing the actual emit. Among other things the dumping process actually dumps scalars and reads them back to make sure no quoting needs to be forced (e.g. when you dump a string that reads back like a date). It also handles single key-value pairs in a list in a special way ( [1, a: 42, 3] instead of the more verbose [1, {a: 42}, 3]. So a simple calculation of the length of the scalars that are the keys and values and separating comma, colon and spaces is not going to be precise.
A different approach is to dump your data with a large line width and parse the output and make a set of line numbers for which the line is too long according to the width that you actually want to use. After loading that output back you can walk over the data structure recursively, inspect the .lc attribute to determine the line number on which a flow style mapping (or sequence) started and if that line number is in the set you built beforehand change the mapping to block style. If you have nested flow-style collections, you might have to repeat this process.
If you run the following, the initial dumped value for quote will be on one line.
The change_to_block method as presented changes all mappings/sequences that are too long
that are on one line.
import sys
import ruamel.yaml
yaml_str = """\
movie: bladerunner
quote: {[Batty, Roy]: [
I have seen things you people wouldn't believe.,
Attack ships on fire off the shoulder of Orion.,
I watched C-beams glitter in the dark near the Tannhäuser Gate.,
]}
"""
class Blockify:
def __init__(self, width, only_first=False, verbose=0):
self._width = width
self._yaml = None
self._only_first = only_first
self._verbose = verbose
#property
def yaml(self):
if self._yaml is None:
self._yaml = y = ruamel.yaml.YAML(typ=['rt', 'string'])
y.preserve_quotes = True
y.width = 2**16
return self._yaml
def __call__(self, d):
pass_nr = 0
changed = [True]
while changed[0]:
changed[0] = False
try:
s = self.yaml.dumps(d)
except AttributeError:
print("use 'pip install ruamel.yaml.string' to install plugin that gives 'dumps' to string")
sys.exit(1)
if self._verbose > 1:
print(s)
too_long = set()
max_ll = -1
for line_nr, line in enumerate(s.splitlines()):
if len(line) > self._width:
too_long.add(line_nr)
if len(line) > max_ll:
max_ll = len(line)
if self._verbose > 0:
print(f'pass: {pass_nr}, lines: {sorted(too_long)}, longest: {max_ll}')
sys.stdout.flush()
new_d = self.yaml.load(s)
self.change_to_block(new_d, too_long, changed, only_first=self._only_first)
d = new_d
pass_nr += 1
return d, s
#staticmethod
def change_to_block(d, too_long, changed, only_first):
if isinstance(d, dict):
if d.fa.flow_style() and d.lc.line in too_long:
d.fa.set_block_style()
changed[0] = True
return # don't convert nested flow styles, might not be necessary
# don't change keys if any value is changed
for v in d.values():
Blockify.change_to_block(v, too_long, changed, only_first)
if only_first and changed[0]:
return
if changed[0]: # don't change keys if value has changed
return
for k in d:
Blockify.change_to_block(k, too_long, changed, only_first)
if only_first and changed[0]:
return
if isinstance(d, (list, tuple)):
if d.fa.flow_style() and d.lc.line in too_long:
d.fa.set_block_style()
changed[0] = True
return # don't convert nested flow styles, might not be necessary
for elem in d:
Blockify.change_to_block(elem, too_long, changed, only_first)
if only_first and changed[0]:
return
blockify = Blockify(96, verbose=2) # set verbose to 0, to suppress progress output
yaml = ruamel.yaml.YAML(typ=['rt', 'string'])
data = yaml.load(yaml_str)
blockified_data, string_output = blockify(data)
print('-'*32, 'result:', '-'*32)
print(string_output) # string_output has no final newline
which gives:
movie: bladerunner
quote: {[Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]}
pass: 0, lines: [1], longest: 186
movie: bladerunner
quote:
[Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]
pass: 1, lines: [2], longest: 179
movie: bladerunner
quote:
[Batty, Roy]:
- I have seen things you people wouldn't believe.
- Attack ships on fire off the shoulder of Orion.
- I watched C-beams glitter in the dark near the Tannhäuser Gate.
pass: 2, lines: [], longest: 67
-------------------------------- result: --------------------------------
movie: bladerunner
quote:
[Batty, Roy]:
- I have seen things you people wouldn't believe.
- Attack ships on fire off the shoulder of Orion.
- I watched C-beams glitter in the dark near the Tannhäuser Gate.
Please note that when using ruamel.yaml<0.18 the sequence [Batty, Roy] never will be in block style
because the tuple subclass CommentedKeySeq does never get a line number attached.

Related

How can I use the ruamel.yaml rtsc mode?

I've been working on creating a YAML re-formatter based on ruamel.yaml (which you can see here).
I'm currently using version 0.17.20.
Cleaning up comments and whitespace has been difficult. I want to:
ensure there is only one space before the # for EOL comments
align full line comments with the key or item immediately following
remove duplicate blank lines so there is at most one blank line
To get closer to achieving that, I have a custom Emitter class where I extend write_comment to adjust the comments just before writing with super().write_comment(...). However, the Emitter does not know about which key or item comes next because comments are generally attached as post comments.
As I've studied the ruamel.yaml code to figure out how to do this, I found the rtsc mode (Round Trip Split Comments) which looks fantastic because it separates EOLComment, BlankLineComment and FullLineComment instead of lumping them together.
From what I can tell, the Parser and Scanner have been adjusted to capture the comments. So, loading is (mostly?) implemented with this "NEWCMNT" implementation. But Emitter.write_comment expects CommentToken instead of comment line numbers, so dumping does not work yet.
If I update my Emitter.write_comment method, is that enough to finish dumping? Or what else might be necessary? In one of my tries, I ran into a sys.exit in ScannedComments.assign_eol() - what else is needed to finish that?
PS: I wouldn't normally ask how to collaborate on StackOverflow, but this is not a bug report or a feature request, and I'm trying/failing to use a new (undocumented) feature, so I'm filing this here instead of sourceforge.
rtsc is work in progress cq work started but unfinished. It's internals will almost certainly change.
Two of the three points you indicate can relatively easy be implemented:
set the column of each comment to 0 ( by recursively going over a loaded data structure similar to here ) if the column is before the position of the end of the value on a line, you'll get one space between the value and the column
at the same time doing the recursion in the previous point. Take each comment value and do something like:
value = '\n'.join(line.strip() for line in value.splitlines()
while '\n\n\n' in value:
value = value.replace('\n\n\n', '\n\n')
The indentation to the following element is difficult, depends on the
data structure etc. Given that these are full line comments, I suggest
you do some postprocessing of the YAML document you generate:
find a full line comment, gather full line comments until next line is
not full line comment (i.e. some "real" YAML). Since full line comments
are in column[0] if the previous stuff is applied, you don't have to
track if you are in a (multi-line) literal or folded scalar string where
one of the lines happens to start with #
determine number of spaces
before real YAML and apply these to the full line comments.
import sys
import ruamel.yaml
yaml_str = """\
# the following is a example YAML doc
a:
- b: 42
# collapse multiple empty lines
c: |
# this is not a comment
it is the first line of a block style literal scalar
processing this gobbles a newline which doesn't go into a comment
# that is unless you have a (dedented) comment directly following
d: 42 # and some non-full line comment
e: # another one
# and some more comments to align
f: glitter in the dark near the Tannhäuser gate
"""
def redo_comments(d):
def do_one(comment):
if not comment:
return
comment.column = 0
value = '\n'.join(line.strip() for line in comment.value.splitlines()) + '\n'
while '\n\n\n' in value:
value = value.replace('\n\n\n', '\n\n')
comment.value = value
def do_values(v):
for x in v:
for comment in x:
do_one(comment)
def do_loc(v):
if v is None:
return
do_one(v[0])
if not v[1]:
return
for comment in v[1]:
do_one(comment)
if isinstance(d, dict):
do_loc(d.ca.comment)
do_values(d.ca.items.values())
for val in d.values():
redo_comments(val)
elif isinstance(d, list):
do_values(d.ca.items.values())
for elem in d:
redo_comments(elem)
def realign_full_line_comments(s):
res = []
buf = []
for line in s.splitlines(True):
if not buf:
if line and line[0] == '#':
buf.append(line)
else:
res.append(line)
else:
if line[0] in '#\n':
buf.append(line)
else:
# YAML line, determine indent
count = 0
while line[count] == ' ':
count += 1
if count > len(line):
break # superfluous?
indent = ' ' * count
for cline in buf:
if cline[0] == '\n': # empty
res.append(cline)
else:
res.append(indent + cline)
buf = []
res.append(line)
return ''.join(res)
yaml = ruamel.yaml.YAML()
# yaml.indent(mapping=4, sequence=4, offset=2)
# yaml.preserve_quotes = True
data = yaml.load(yaml_str)
redo_comments(data)
yaml.dump(data, sys.stdout, transform=realign_full_line_comments)
which gives:
# the following is a example YAML doc
a:
- b: 42
# collapse multiple empty lines
c: |
# this is not a comment
it is the first line of a block style literal scalar
processing this gobbles a newline which doesn't go into a comment
# that is unless you have a (dedented) comment directly following
d: 42 # and some non-full line comment
e: # another one
# and some more comments to align
f: glitter in the dark near the Tannhäuser gate

Trying to put together a teaching-example with pyhf

I'm trying to learn more about pyhf and my understanding of what the goals are might be limited. I would love to fit my HEP data outside of ROOT, but I could be imposing expectations on pyhf which are not what the authors intended for it's use.
I'd like to write myself a hello-world example, but I might just not know what I'm doing. My misunderstanding could also be gaps in my statistical knowledge.
With that preface, let me explain what I'm trying to explore.
I have some observed set of events for which I calculate some observable and make a binned histogram of that data. I hypothesize that there are two contributing physics processes, which I call signal and background. I generate some Monte Carlo samples for these processes and the theorized total number of events is close to, but not exactly what I observe.
I would like to:
Fit the data to this two process hypothesis
Get from the fit the optimal values for the number of events for each process
Get the uncertainties on these fitted values
If appropriate, calculate an upper limit on the number of signal events.
My starter code is below, where all I'm doing is an ML fit but I'm not sure where to go. I know it's not set up to do what I want, but I'm getting lost in the examples I find on RTD. I'm sure it's me, this is not a criticism of the documentation.
import pyhf
import numpy as np
import matplotlib.pyplot as plt
nbins = 15
# Generate a background and signal MC sample`
MC_signal_events = np.random.normal(5,1.0,200)
MC_background_events = 10*np.random.random(1000)
signal_data = np.histogram(MC_signal_events,bins=nbins)[0]
bkg_data = np.histogram(MC_background_events,bins=nbins)[0]
# Generate an observed dataset with a slightly different
# number of events
signal_events = np.random.normal(5,1.0,180)
background_events = 10*np.random.random(1050)
observed_events = np.array(signal_events.tolist() + background_events.tolist())
observed_sample = np.histogram(observed_events,bins=nbins)[0]
# Plot these samples, if you like
plt.figure(figsize=(12,4))
plt.subplot(1,3,1)
plt.hist(observed_events,bins=nbins,label='Observations')
plt.legend()
plt.subplot(1,3,2)
plt.hist(MC_signal_events,bins=nbins,label='MC signal')
plt.legend()
plt.subplot(1,3,3)
plt.hist(MC_background_events,bins=nbins,label='MC background')
plt.legend()
# Use a very naive estimate of the background
# uncertainties
bkg_uncerts = np.sqrt(bkg_data)
print("Defining the PDF.......")
pdf = pyhf.simplemodels.hepdata_like(signal_data=signal_data.tolist(), \
bkg_data=bkg_data.tolist(), \
bkg_uncerts=bkg_uncerts.tolist())
print("Fit.......")
data = pyhf.tensorlib.astensor(observed_sample.tolist() + pdf.config.auxdata)
bestfit_pars, twice_nll = pyhf.infer.mle.fit(data, pdf, return_fitted_val=True)
print(bestfit_pars)
print(twice_nll)
plt.show()
Note: this answer is based on pyhf v0.5.2.
Alright, so it looks like you've managed to figure most of the big pieces for sure. However, there's two different ways to do this depending on how you prefer to set things up. In both cases, I assume you want an unconstrained fit and you want to...
fit your signal+background model to observed data
fit your background model to observed data
First, let's discuss uncertainties briefly. At the moment, we default to numpy for the tensor background and scipy for the optimizer. See documentation:
numpy backend
scipy optimizer
However, one unfortunate drawback right now with the scipy optimizer is that it cannot return the uncertainties. What you need to do anywhere in your code before the fit (although we generally recommend as early as possible) is to use the minuit optimizer instead:
pyhf.set_backend('numpy', 'minuit')
This will get you the nice features of being able to get the correlation matrix, the uncertainties on the fitted parameters, and the hessian -- amongst other things. We're working to make this consistent for scipy as well, but this is not ready right now.
All optimizations go through our optimizer API which you can currently view through the mixin here in our documentation. Specifically, the signature is
minimize(
objective,
data,
pdf,
init_pars,
par_bounds,
fixed_vals=None,
return_fitted_val=False,
return_result_obj=False,
do_grad=None,
do_stitch=False,
**kwargs)
There are a lot of options here. Let's just focus on the fact that one of the keyword arguments we can pass through is return_uncertainties which will change the bestfit parameters by adding a column for the fitted parameter uncertainty which you want.
1. Signal+Background
In this case, we want to just use the default model
result, twice_nll = pyhf.infer.mle.fit(
data,
pdf,
return_uncertainties=True,
return_fitted_val=True
)
bestfit_pars, errors = result.T
2. Background-Only
In this case, we need to turn off the signal. The way we do this is by setting the parameter of interest (POI) fixed to 0.0. Then we can get the fitted parameters for the background-only model in a similar way, but using fixed_poi_fit instead of an unconstrained fit:
result, twice_nll = pyhf.infer.mle.fixed_poi_fit(
0.0,
data,
pdf,
return_uncertainties=True,
return_fitted_val=True
)
bestfit_pars, errors = result.T
Note that this is quite simply a quick way of doing the following unconstrained fit
bkg_params = pdf.config.suggested_init()
fixed_params = pdf.config.suggested_fixed()
bkg_params[pdf.config.poi_index] = 0.0
fixed_params[pdf.config.poi_index] = True
result, twice_nll = pyhf.infer.mle.fit(
data,
pdf,
init_pars=bkg_params,
fixed_params=fixed_params,
return_uncertainties=True,
return_fitted_val=True
)
bestfit_pars, errors = result.T
Hopefully that clarifies things up more!
Giordon's solution should answer all of your question, but I thought I'd also write out the code to basically address everything we can.
I also take the liberty of changing some of your values a bit so that the signal isn't so strong that the observed CLs value isn't far off to the right of the Brazil band (the results aren't wrong obviously, but it probably makes more sense to be talking about using the discovery test statistic at that point then setting limits. :))
Environment
For this example I'm going to setup a clean Python 3 virtual environment and then install the dependencies (here we're going to be using pyhf v0.5.2)
$ python3 -m venv "${HOME}/.venvs/question"
$ . "${HOME}/.venvs/question/bin/activate"
(question) $ cat requirements.txt
pyhf[minuit,contrib]~=0.5.2
black
(question) $ python -m pip install -r requirements.txt
Code
While we can't easily get the best fit value for both the number of signal events as well as the background events we definitely can do inference to get the best fit value for the signal strength.
The following chunk of code (which is long only because of the visualization) should address all of the points of your question.
# answer.py
import numpy as np
import pyhf
import matplotlib.pyplot as plt
import pyhf.contrib.viz.brazil
# Goals:
# - Fit the model to the observed data
# - Infer the best fit signal strength given the model
# - Get the uncertainties on the best fit signal strength
# - Calculate an 95% CL upper limit on the signal strength
def plot_hist(ax, bins, data, bottom=0, color=None, label=None):
bin_width = bins[1] - bins[0]
bin_leftedges = bins[:-1]
bin_centers = [edge + bin_width / 2.0 for edge in bin_leftedges]
ax.bar(
bin_centers, data, bin_width, bottom=bottom, alpha=0.5, color=color, label=label
)
def plot_data(ax, bins, data, label="Data"):
bin_width = bins[1] - bins[0]
bin_leftedges = bins[:-1]
bin_centers = [edge + bin_width / 2.0 for edge in bin_leftedges]
ax.scatter(bin_centers, data, color="black", label=label)
def invert_interval(test_mus, hypo_tests, test_size=0.05):
# This will be taken care of in v0.5.3
cls_obs = np.array([test[0] for test in hypo_tests]).flatten()
cls_exp = [
np.array([test[1][idx] for test in hypo_tests]).flatten() for idx in range(5)
]
crossing_test_stats = {"exp": [], "obs": None}
for cls_exp_sigma in cls_exp:
crossing_test_stats["exp"].append(
np.interp(
test_size, list(reversed(cls_exp_sigma)), list(reversed(test_mus))
)
)
crossing_test_stats["obs"] = np.interp(
test_size, list(reversed(cls_obs)), list(reversed(test_mus))
)
return crossing_test_stats
def main():
np.random.seed(0)
pyhf.set_backend("numpy", "minuit")
observable_range = [0.0, 10.0]
bin_width = 0.5
_bins = np.arange(observable_range[0], observable_range[1] + bin_width, bin_width)
n_bkg = 2000
n_signal = int(np.sqrt(n_bkg))
# Generate simulation
bkg_simulation = 10 * np.random.random(n_bkg)
signal_simulation = np.random.normal(5, 1.0, n_signal)
bkg_sample, _ = np.histogram(bkg_simulation, bins=_bins)
signal_sample, _ = np.histogram(signal_simulation, bins=_bins)
# Generate observations
signal_events = np.random.normal(5, 1.0, int(n_signal * 0.8))
bkg_events = 10 * np.random.random(int(n_bkg + np.sqrt(n_bkg)))
observed_events = np.array(signal_events.tolist() + bkg_events.tolist())
observed_sample, _ = np.histogram(observed_events, bins=_bins)
# Visualize the simulation and observations
fig, ax = plt.subplots()
fig.set_size_inches(7, 5)
plot_hist(ax, _bins, bkg_sample, label="Background")
plot_hist(ax, _bins, signal_sample, bottom=bkg_sample, label="Signal")
plot_data(ax, _bins, observed_sample)
ax.legend(loc="best")
ax.set_ylim(top=np.max(observed_sample) * 1.4)
ax.set_xlabel("Observable")
ax.set_ylabel("Count")
fig.savefig("components.png")
# Build the model
bkg_uncerts = np.sqrt(bkg_sample)
model = pyhf.simplemodels.hepdata_like(
signal_data=signal_sample.tolist(),
bkg_data=bkg_sample.tolist(),
bkg_uncerts=bkg_uncerts.tolist(),
)
data = pyhf.tensorlib.astensor(observed_sample.tolist() + model.config.auxdata)
# Perform inference
fit_result = pyhf.infer.mle.fit(data, model, return_uncertainties=True)
bestfit_pars, par_uncerts = fit_result.T
print(
f"best fit parameters:\
\n * signal strength: {bestfit_pars[0]} +/- {par_uncerts[0]}\
\n * nuisance parameters: {bestfit_pars[1:]}\
\n * nuisance parameter uncertainties: {par_uncerts[1:]}"
)
# Perform hypothesis test scan
_start = 0.0
_stop = 5
_step = 0.1
poi_tests = np.arange(_start, _stop + _step, _step)
print("\nPerforming hypothesis tests\n")
hypo_tests = [
pyhf.infer.hypotest(
mu_test,
data,
model,
return_expected_set=True,
return_test_statistics=True,
qtilde=True,
)
for mu_test in poi_tests
]
# Upper limits on signal strength
results = invert_interval(poi_tests, hypo_tests)
print(f"Observed Limit on µ: {results['obs']:.2f}")
print("-----")
for idx, n_sigma in enumerate(np.arange(-2, 3)):
print(
"Expected {}Limit on µ: {:.3f}".format(
" " if n_sigma == 0 else "({} σ) ".format(n_sigma),
results["exp"][idx],
)
)
# Visualize the "Brazil band"
fig, ax = plt.subplots()
fig.set_size_inches(7, 5)
ax.set_title("Hypothesis Tests")
ax.set_ylabel(r"$\mathrm{CL}_{s}$")
ax.set_xlabel(r"$\mu$")
pyhf.contrib.viz.brazil.plot_results(ax, poi_tests, hypo_tests)
fig.savefig("brazil_band.png")
if __name__ == "__main__":
main()
which when run gives
(question) $ python answer.py
best fit parameters:
* signal strength: 1.5884737977889158 +/- 0.7803435235862329
* nuisance parameters: [0.99020988 1.06040191 0.90488207 1.03531383 1.09093327 1.00942088
1.07789316 1.01125627 1.06202964 0.95780043 0.94990993 1.04893286
1.0560711 0.9758487 0.93692481 1.04683181 1.05785515 0.92381263
0.93812855 0.96751869]
* nuisance parameter uncertainties: [0.06966439 0.07632218 0.0611428 0.07230328 0.07872258 0.06899675
0.07472849 0.07403246 0.07613661 0.08606657 0.08002775 0.08655314
0.07564512 0.07308117 0.06743479 0.07383134 0.07460864 0.06632003
0.06683251 0.06270965]
Performing hypothesis tests
/home/stackoverflow/.venvs/question/lib/python3.7/site-packages/pyhf/infer/calculators.py:229: RuntimeWarning: invalid value encountered in double_scalars
teststat = (qmu - qmu_A) / (2 * self.sqrtqmuA_v)
Observed Limit on µ: 2.89
-----
Expected (-2 σ) Limit on µ: 0.829
Expected (-1 σ) Limit on µ: 1.110
Expected Limit on µ: 1.542
Expected (1 σ) Limit on µ: 2.147
Expected (2 σ) Limit on µ: 2.882
Let us know if you have any further questions!

How to calculate shap values for ADABoost model?

I am running 3 different model (Random forest, Gradient Boosting, Ada Boost) and a model ensemble based on these 3 models.
I managed to use SHAP for GB and RF but not for ADA with the following error:
Exception Traceback (most recent call last)
in engine
----> 1 explainer = shap.TreeExplainer(model,data = explain_data.head(1000), model_output= 'probability')
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, model_output, feature_perturbation, **deprecated_options)
110 self.feature_perturbation = feature_perturbation
111 self.expected_value = None
--> 112 self.model = TreeEnsemble(model, self.data, self.data_missing)
113
114 if feature_perturbation not in feature_perturbation_codes:
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, data_missing)
752 self.tree_output = "probability"
753 else:
--> 754 raise Exception("Model type not yet supported by TreeExplainer: " + str(type(model)))
755
756 # build a dense numpy version of all the tree objects
Exception: Model type not yet supported by TreeExplainer: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
I found this link on Git that state
TreeExplainer creates a TreeEnsemble object from whatever model type we are trying to explain, and then works with that downstream. So all you would need to do is and add another if statement in the
TreeEnsemble constructor similar to the one for gradient boosting
But I really don't know how to implement it since I quite new to this.
I had the same problem and what I did, was to modify the file in the git you are commenting.
In my case I use windows so the file is in C:\Users\my_user\AppData\Local\Continuum\anaconda3\Lib\site-packages\shap\explainers but you can do double click over the error message and the file will be opened.
The next step is to add another elif as the answer of the git help says. In my case I did it from the line 404 as following:
1) Modify the source code.
...
self.objective = objective_name_map.get(model.criterion, None)
self.tree_output = "probability"
elif str(type(model)).endswith("sklearn.ensemble.weight_boosting.AdaBoostClassifier'>"): #From this line I have modified the code
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line I added
elif str(type(model)).endswith("sklearn.ensemble.forest.ExtraTreesClassifier'>"): # TODO: add unit test for this case
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
...
Note in the other models, the code of shap needs the attribute 'criterion' that the AdaBoost classifier doesn't have in a direct way. So in this case this attribute is obtained from the "weak" classifiers with the AdaBoost has been trained, that's why I add model.base_estimator_.criterion .
Finally you have to import the library again, train your model and get the shap values. I leave an example:
2) Import again the library and try:
from sklearn import datasets
from sklearn.ensemble import AdaBoostClassifier
import shap
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
ADABoost_model = AdaBoostClassifier()
ADABoost_model.fit(X, y)
shap_values = shap.TreeExplainer(ADABoost_model).shap_values(X)
shap.summary_plot(shap_values, X, plot_type="bar")
Which generates the following:
3) Get your new results:
It seems that the shap package has been updated and still does not contain the AdaBoostClassifier. Based on the previous answer, I've modified the previous answer to work with the shap/explainers/tree.py file in lines 598-610
### Added AdaBoostClassifier based on the outdated StackOverflow response and Github issue here
### https://stackoverflow.com/questions/60433389/how-to-calculate-shap-values-for-adaboost-model/61108156#61108156
### https://github.com/slundberg/shap/issues/335
elif safe_isinstance(model, ["sklearn.ensemble.AdaBoostClassifier", "sklearn.ensemble._weighted_boosting.AdaBoostClassifier"]):
assert hasattr(model, "estimators_"), "Model has no `estimators_`! Have you called `model.fit`?"
self.internal_dtype = model.estimators_[0].tree_.value.dtype.type
self.input_dtype = np.float32
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line added
Also working on testing to add this to the package :)

Error in setting max features parameter in Isolation Forest algorithm using sklearn

I'm trying to train a dataset with 357 features using Isolation Forest sklearn implementation. I can successfully train and get results when the max features variable is set to 1.0 (the default value).
However when max features is set to 2, it gives the following error:
ValueError: Number of features of the model must match the input.
Model n_features is 2 and input n_features is 357
It also gives the same error when the feature count is 1 (int) and not 1.0 (float).
How I understood was that when the feature count is 2 (int), two features should be considered in creating each tree. Is this wrong? How can I change the max features parameter?
The code is as follows:
from sklearn.ensemble.iforest import IsolationForest
def isolation_forest_imp(dataset):
estimators = 10
samples = 100
features = 2
contamination = 0.1
bootstrap = False
random_state = None
verbosity = 0
estimator = IsolationForest(n_estimators=estimators, max_samples=samples, contamination=contamination,
max_features=features,
bootstrap=boostrap, random_state=random_state, verbose=verbosity)
model = estimator.fit(dataset)
In the documentation it states:
max_features : int or float, optional (default=1.0)
The number of features to draw from X to train each base estimator.
- If int, then draw `max_features` features.
- If float, then draw `max_features * X.shape[1]` features.
So, 2 should mean take two features and 1.0 should mean take all of the features, 0.5 take half and so on, from what I understand.
I think this could be a bug, since, taking a look in IsolationForest's fit:
# Isolation Forest inherits from BaseBagging
# and when _fit is called, BaseBagging takes care of the features correctly
super(IsolationForest, self)._fit(X, y, max_samples,
max_depth=max_depth,
sample_weight=sample_weight)
# however, when after _fit the decision_function is called using X - the whole sample - not taking into account the max_features
self.threshold_ = -sp.stats.scoreatpercentile(
-self.decision_function(X), 100. * (1. - self.contamination))
then:
# when the decision function _validate_X_predict is called, with X unmodified,
# it calls the base estimator's (dt) _validate_X_predict with the whole X
X = self.estimators_[0]._validate_X_predict(X, check_input=True)
...
# from tree.py:
def _validate_X_predict(self, X, check_input):
"""Validate X whenever one tries to predict, apply, predict_proba"""
if self.tree_ is None:
raise NotFittedError("Estimator not fitted, "
"call `fit` before exploiting the model.")
if check_input:
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
if issparse(X) and (X.indices.dtype != np.intc or
X.indptr.dtype != np.intc):
raise ValueError("No support for np.int64 index based "
"sparse matrices")
# so, this check fails because X is the original X, not with the max_features applied
n_features = X.shape[1]
if self.n_features_ != n_features:
raise ValueError("Number of features of the model must "
"match the input. Model n_features is %s and "
"input n_features is %s "
% (self.n_features_, n_features))
return X
So, I am not sure on how you can handle this. Maybe figure out the percentage that leads to just the two features you need - even though I am not sure it'll work as expected.
Note: I am using scikit-learn v.0.18
Edit: as #Vivek Kumar commented this is an issue and upgrading to 0.20 should do the trick.

YAML deserializer with position information?

Does anyone know of a YAML deserializer that can provide position information for the constructed objects?
I know how to deserialize a YAML file into a Java object. Simple instructions on http://yamlbeans.sourceforge.net/.
However, I want to do some algorithmic validation on the deserialized object and report error back to the user pointing to the position in the YAML that cause the error.
Example:
=========YAML file==========
name: Nathan Sweet
age: 28
address: 4011 16th Ave S
=======JAVA class======
public class Contact {
public String name;
public int age;
public String address;
}
Imagine if I want to first load the yaml into Contact class and then validate the address against some repository and error back if its invalid. Something like:
'Line 3 Column 9: The address does not match valid entry in the database'
The problem is, currently there is no way to get the position inside a deserialized object from YAML.
Anyone know a solution to this issue?
Most YAML parsers, if they keep any information about positions around they drop it while constructing the language native objects.
In ruamel.yaml ¹, I keep more information around because I want to be able to round-trip with minimal loss of original layout (e.g. keeping comments and key order in mappings).
I don't keep information on individual key-value pairs, but I do on the "upper-left" position of a mapping². Because of the kept order of the mapping items you can give some rather nice feedback. Given an input file:
- name: anthon
age: 53
adres: Rijn en Schiekade 105
- name: Nathan Sweet
age: 28
address: 4011 16th Ave S
And a program that you call with the input file as argument:
#! /usr/bin/env python2.7
# coding: utf-8
# http://stackoverflow.com/questions/30677517/yaml-deserializer-with-position-information?noredirect=1#comment49491314_30677517
import sys
import ruamel.yaml
up_arrow = '↑'
def key_error(key, value, line, col, error, e='E'):
print('E[{}:{}]: {}'.format(line, col, error))
print('{}{}: {}'.format(' '*col, key, value))
print('{}{}'.format(' '*(col), up_arrow))
print('---')
def value_error(key, value, line, col, error, e='E'):
val_col = col + len(key) + 2
print('{}[{}:{}]: {}'.format(e, line, val_col, error))
print('{}{}: {}'.format(' '*col, key, value))
print('{}{}'.format(' '*(val_col), up_arrow))
print('---')
def value_warning(key, value, line, col, error):
value_error(key, value, line, col, error, e='W')
class Contact(object):
def __init__(self, vals):
for offset, k in enumerate(vals):
self.check(k, vals[k], vals.lc.line+offset, vals.lc.col)
for k in ['name', 'address', 'age']:
if k not in vals:
print('K[{}:{}]: {}'.format(
vals.lc.line+offset, vals.lc.col, "missing key: "+k
))
print('---')
def check(self, key, value, line, col):
if key == 'name':
if value[0].lower() == value[0]:
value_error(key, value, line, col,
'value should start with uppercase')
elif key == 'age':
if value < 50:
value_warning(key, value, line, col,
'probably too young for knowing ALGOL 60')
elif key == 'address':
pass
else:
key_error(key, value, line, col,
"unexpected key")
data = ruamel.yaml.load(open(sys.argv[1]), Loader=ruamel.yaml.RoundTripLoader)
for x in data:
contact = Contact(x)
giving you E(rrors), W(arnings) and K(eys missing):
E[0:8]: value should start with uppercase
name: anthon
↑
---
E[2:2]: unexpected key
adres: Rijn en Schiekade 105
↑
---
K[2:2]: missing key: address
---
W[4:7]: probably too young for knowing ALGOL 60
age: 28
↑
---
Which you should be able to parser in a calling program in any language to give feedback. The check method of course need adjusting to your requirements. This is not as nice as being to do that in the language the rest of your application is in, but it might be better than nothing.
In my experience handling the above format is certainly simpler than extending an existing (open source) YAML parser.
¹ Disclaimer: I am the author of that package
² I want to use that kind of information at some point to preserve spurious newlines, inserted for readability
In python, you can readily write custom Dumper/Loader objects and use them to load (or dump) your yaml code. You can have these objects track the file/line info:
import yaml
from collections import OrderedDict
class YamlOrderedDict(OrderedDict):
"""
An OrderedDict that was loaded from a yaml file, and is annotated
with file/line info for reporting about errors in the source file
"""
def _annotate(self, node):
self._key_locs = {}
self._value_locs = {}
nodeiter = node.value.__iter__()
for key in self:
subnode = nodeiter.next()
self._key_locs[key] = subnode[0].start_mark.name + ':' + \
str(subnode[0].start_mark.line+1)
self._value_locs[key] = subnode[1].start_mark.name + ':' + \
str(subnode[1].start_mark.line+1)
def key_loc(self, key):
try:
return self._key_locs[key]
except AttributeError, KeyError:
return ''
def value_loc(self, key):
try:
return self._value_locs[key]
except AttributeError, KeyError:
return ''
# Use YamlOrderedDict objects for yaml maps instead of normal dict
yaml.add_representer(OrderedDict, lambda dumper, data:
dumper.represent_dict(data.iteritems()))
yaml.add_representer(YamlOrderedDict, lambda dumper, data:
dumper.represent_dict(data.iteritems()))
def _load_YamlOrderedDict(loader, node):
rv = YamlOrderedDict(loader.construct_pairs(node))
rv._annotate(node)
return rv
yaml.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, _load_YamlOrderedDict)
Now when you read a yaml file, any mapping objects will be read as a YamlOrderedDict, which allows looking up the file location of keys in the mapping object. You can also add an iterator method like:
def iter_with_lines(self):
for key, val in self.items():
yield (key, val, self.key_loc(key))
...and now you can write a loop like:
for key,value,location in obj.iter_with_lines():
# iterate through the key/value pairs in a YamlOrderedDict, with
# the source file location

Resources