Not to print header information when using summary(mcmc) in coda package - summary

I have wrote a function that extracts already-thinned and post-warmup MCMC samples after some processing, and the output is a mcmc object. However when I run the summary() function, it naturally does not recognize that the output is already processed, hence display incorrect information in the header, i.e. Iterations etc.
My question is, is there a way to not display the header information when I call summary(mcmc object)?
> summary(output)
Iterations = 1:10000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
beta 1.052e+00 3.189e-02 8.914e-04 9.185e-04
df 3.849e+00 2.916e-01 8.150e-03 1.516e-02
sigma 1.056e-02 2.504e-04 6.998e-06 1.000e-05
tau 8.990e+03 4.273e+02 1.194e+01 1.710e+01
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
beta 9.891e-01 1.032e+00 1.052e+00 1.073e+00 1.113e+00
df 3.304e+00 3.650e+00 3.836e+00 4.042e+00 4.450e+00
sigma 1.004e-02 1.039e-02 1.055e-02 1.072e-02 1.105e-02
tau 8.197e+03 8.700e+03 8.977e+03 9.263e+03 9.917e+03

Related

Trying to put together a teaching-example with pyhf

I'm trying to learn more about pyhf and my understanding of what the goals are might be limited. I would love to fit my HEP data outside of ROOT, but I could be imposing expectations on pyhf which are not what the authors intended for it's use.
I'd like to write myself a hello-world example, but I might just not know what I'm doing. My misunderstanding could also be gaps in my statistical knowledge.
With that preface, let me explain what I'm trying to explore.
I have some observed set of events for which I calculate some observable and make a binned histogram of that data. I hypothesize that there are two contributing physics processes, which I call signal and background. I generate some Monte Carlo samples for these processes and the theorized total number of events is close to, but not exactly what I observe.
I would like to:
Fit the data to this two process hypothesis
Get from the fit the optimal values for the number of events for each process
Get the uncertainties on these fitted values
If appropriate, calculate an upper limit on the number of signal events.
My starter code is below, where all I'm doing is an ML fit but I'm not sure where to go. I know it's not set up to do what I want, but I'm getting lost in the examples I find on RTD. I'm sure it's me, this is not a criticism of the documentation.
import pyhf
import numpy as np
import matplotlib.pyplot as plt
nbins = 15
# Generate a background and signal MC sample`
MC_signal_events = np.random.normal(5,1.0,200)
MC_background_events = 10*np.random.random(1000)
signal_data = np.histogram(MC_signal_events,bins=nbins)[0]
bkg_data = np.histogram(MC_background_events,bins=nbins)[0]
# Generate an observed dataset with a slightly different
# number of events
signal_events = np.random.normal(5,1.0,180)
background_events = 10*np.random.random(1050)
observed_events = np.array(signal_events.tolist() + background_events.tolist())
observed_sample = np.histogram(observed_events,bins=nbins)[0]
# Plot these samples, if you like
plt.figure(figsize=(12,4))
plt.subplot(1,3,1)
plt.hist(observed_events,bins=nbins,label='Observations')
plt.legend()
plt.subplot(1,3,2)
plt.hist(MC_signal_events,bins=nbins,label='MC signal')
plt.legend()
plt.subplot(1,3,3)
plt.hist(MC_background_events,bins=nbins,label='MC background')
plt.legend()
# Use a very naive estimate of the background
# uncertainties
bkg_uncerts = np.sqrt(bkg_data)
print("Defining the PDF.......")
pdf = pyhf.simplemodels.hepdata_like(signal_data=signal_data.tolist(), \
bkg_data=bkg_data.tolist(), \
bkg_uncerts=bkg_uncerts.tolist())
print("Fit.......")
data = pyhf.tensorlib.astensor(observed_sample.tolist() + pdf.config.auxdata)
bestfit_pars, twice_nll = pyhf.infer.mle.fit(data, pdf, return_fitted_val=True)
print(bestfit_pars)
print(twice_nll)
plt.show()
Note: this answer is based on pyhf v0.5.2.
Alright, so it looks like you've managed to figure most of the big pieces for sure. However, there's two different ways to do this depending on how you prefer to set things up. In both cases, I assume you want an unconstrained fit and you want to...
fit your signal+background model to observed data
fit your background model to observed data
First, let's discuss uncertainties briefly. At the moment, we default to numpy for the tensor background and scipy for the optimizer. See documentation:
numpy backend
scipy optimizer
However, one unfortunate drawback right now with the scipy optimizer is that it cannot return the uncertainties. What you need to do anywhere in your code before the fit (although we generally recommend as early as possible) is to use the minuit optimizer instead:
pyhf.set_backend('numpy', 'minuit')
This will get you the nice features of being able to get the correlation matrix, the uncertainties on the fitted parameters, and the hessian -- amongst other things. We're working to make this consistent for scipy as well, but this is not ready right now.
All optimizations go through our optimizer API which you can currently view through the mixin here in our documentation. Specifically, the signature is
minimize(
objective,
data,
pdf,
init_pars,
par_bounds,
fixed_vals=None,
return_fitted_val=False,
return_result_obj=False,
do_grad=None,
do_stitch=False,
**kwargs)
There are a lot of options here. Let's just focus on the fact that one of the keyword arguments we can pass through is return_uncertainties which will change the bestfit parameters by adding a column for the fitted parameter uncertainty which you want.
1. Signal+Background
In this case, we want to just use the default model
result, twice_nll = pyhf.infer.mle.fit(
data,
pdf,
return_uncertainties=True,
return_fitted_val=True
)
bestfit_pars, errors = result.T
2. Background-Only
In this case, we need to turn off the signal. The way we do this is by setting the parameter of interest (POI) fixed to 0.0. Then we can get the fitted parameters for the background-only model in a similar way, but using fixed_poi_fit instead of an unconstrained fit:
result, twice_nll = pyhf.infer.mle.fixed_poi_fit(
0.0,
data,
pdf,
return_uncertainties=True,
return_fitted_val=True
)
bestfit_pars, errors = result.T
Note that this is quite simply a quick way of doing the following unconstrained fit
bkg_params = pdf.config.suggested_init()
fixed_params = pdf.config.suggested_fixed()
bkg_params[pdf.config.poi_index] = 0.0
fixed_params[pdf.config.poi_index] = True
result, twice_nll = pyhf.infer.mle.fit(
data,
pdf,
init_pars=bkg_params,
fixed_params=fixed_params,
return_uncertainties=True,
return_fitted_val=True
)
bestfit_pars, errors = result.T
Hopefully that clarifies things up more!
Giordon's solution should answer all of your question, but I thought I'd also write out the code to basically address everything we can.
I also take the liberty of changing some of your values a bit so that the signal isn't so strong that the observed CLs value isn't far off to the right of the Brazil band (the results aren't wrong obviously, but it probably makes more sense to be talking about using the discovery test statistic at that point then setting limits. :))
Environment
For this example I'm going to setup a clean Python 3 virtual environment and then install the dependencies (here we're going to be using pyhf v0.5.2)
$ python3 -m venv "${HOME}/.venvs/question"
$ . "${HOME}/.venvs/question/bin/activate"
(question) $ cat requirements.txt
pyhf[minuit,contrib]~=0.5.2
black
(question) $ python -m pip install -r requirements.txt
Code
While we can't easily get the best fit value for both the number of signal events as well as the background events we definitely can do inference to get the best fit value for the signal strength.
The following chunk of code (which is long only because of the visualization) should address all of the points of your question.
# answer.py
import numpy as np
import pyhf
import matplotlib.pyplot as plt
import pyhf.contrib.viz.brazil
# Goals:
# - Fit the model to the observed data
# - Infer the best fit signal strength given the model
# - Get the uncertainties on the best fit signal strength
# - Calculate an 95% CL upper limit on the signal strength
def plot_hist(ax, bins, data, bottom=0, color=None, label=None):
bin_width = bins[1] - bins[0]
bin_leftedges = bins[:-1]
bin_centers = [edge + bin_width / 2.0 for edge in bin_leftedges]
ax.bar(
bin_centers, data, bin_width, bottom=bottom, alpha=0.5, color=color, label=label
)
def plot_data(ax, bins, data, label="Data"):
bin_width = bins[1] - bins[0]
bin_leftedges = bins[:-1]
bin_centers = [edge + bin_width / 2.0 for edge in bin_leftedges]
ax.scatter(bin_centers, data, color="black", label=label)
def invert_interval(test_mus, hypo_tests, test_size=0.05):
# This will be taken care of in v0.5.3
cls_obs = np.array([test[0] for test in hypo_tests]).flatten()
cls_exp = [
np.array([test[1][idx] for test in hypo_tests]).flatten() for idx in range(5)
]
crossing_test_stats = {"exp": [], "obs": None}
for cls_exp_sigma in cls_exp:
crossing_test_stats["exp"].append(
np.interp(
test_size, list(reversed(cls_exp_sigma)), list(reversed(test_mus))
)
)
crossing_test_stats["obs"] = np.interp(
test_size, list(reversed(cls_obs)), list(reversed(test_mus))
)
return crossing_test_stats
def main():
np.random.seed(0)
pyhf.set_backend("numpy", "minuit")
observable_range = [0.0, 10.0]
bin_width = 0.5
_bins = np.arange(observable_range[0], observable_range[1] + bin_width, bin_width)
n_bkg = 2000
n_signal = int(np.sqrt(n_bkg))
# Generate simulation
bkg_simulation = 10 * np.random.random(n_bkg)
signal_simulation = np.random.normal(5, 1.0, n_signal)
bkg_sample, _ = np.histogram(bkg_simulation, bins=_bins)
signal_sample, _ = np.histogram(signal_simulation, bins=_bins)
# Generate observations
signal_events = np.random.normal(5, 1.0, int(n_signal * 0.8))
bkg_events = 10 * np.random.random(int(n_bkg + np.sqrt(n_bkg)))
observed_events = np.array(signal_events.tolist() + bkg_events.tolist())
observed_sample, _ = np.histogram(observed_events, bins=_bins)
# Visualize the simulation and observations
fig, ax = plt.subplots()
fig.set_size_inches(7, 5)
plot_hist(ax, _bins, bkg_sample, label="Background")
plot_hist(ax, _bins, signal_sample, bottom=bkg_sample, label="Signal")
plot_data(ax, _bins, observed_sample)
ax.legend(loc="best")
ax.set_ylim(top=np.max(observed_sample) * 1.4)
ax.set_xlabel("Observable")
ax.set_ylabel("Count")
fig.savefig("components.png")
# Build the model
bkg_uncerts = np.sqrt(bkg_sample)
model = pyhf.simplemodels.hepdata_like(
signal_data=signal_sample.tolist(),
bkg_data=bkg_sample.tolist(),
bkg_uncerts=bkg_uncerts.tolist(),
)
data = pyhf.tensorlib.astensor(observed_sample.tolist() + model.config.auxdata)
# Perform inference
fit_result = pyhf.infer.mle.fit(data, model, return_uncertainties=True)
bestfit_pars, par_uncerts = fit_result.T
print(
f"best fit parameters:\
\n * signal strength: {bestfit_pars[0]} +/- {par_uncerts[0]}\
\n * nuisance parameters: {bestfit_pars[1:]}\
\n * nuisance parameter uncertainties: {par_uncerts[1:]}"
)
# Perform hypothesis test scan
_start = 0.0
_stop = 5
_step = 0.1
poi_tests = np.arange(_start, _stop + _step, _step)
print("\nPerforming hypothesis tests\n")
hypo_tests = [
pyhf.infer.hypotest(
mu_test,
data,
model,
return_expected_set=True,
return_test_statistics=True,
qtilde=True,
)
for mu_test in poi_tests
]
# Upper limits on signal strength
results = invert_interval(poi_tests, hypo_tests)
print(f"Observed Limit on µ: {results['obs']:.2f}")
print("-----")
for idx, n_sigma in enumerate(np.arange(-2, 3)):
print(
"Expected {}Limit on µ: {:.3f}".format(
" " if n_sigma == 0 else "({} σ) ".format(n_sigma),
results["exp"][idx],
)
)
# Visualize the "Brazil band"
fig, ax = plt.subplots()
fig.set_size_inches(7, 5)
ax.set_title("Hypothesis Tests")
ax.set_ylabel(r"$\mathrm{CL}_{s}$")
ax.set_xlabel(r"$\mu$")
pyhf.contrib.viz.brazil.plot_results(ax, poi_tests, hypo_tests)
fig.savefig("brazil_band.png")
if __name__ == "__main__":
main()
which when run gives
(question) $ python answer.py
best fit parameters:
* signal strength: 1.5884737977889158 +/- 0.7803435235862329
* nuisance parameters: [0.99020988 1.06040191 0.90488207 1.03531383 1.09093327 1.00942088
1.07789316 1.01125627 1.06202964 0.95780043 0.94990993 1.04893286
1.0560711 0.9758487 0.93692481 1.04683181 1.05785515 0.92381263
0.93812855 0.96751869]
* nuisance parameter uncertainties: [0.06966439 0.07632218 0.0611428 0.07230328 0.07872258 0.06899675
0.07472849 0.07403246 0.07613661 0.08606657 0.08002775 0.08655314
0.07564512 0.07308117 0.06743479 0.07383134 0.07460864 0.06632003
0.06683251 0.06270965]
Performing hypothesis tests
/home/stackoverflow/.venvs/question/lib/python3.7/site-packages/pyhf/infer/calculators.py:229: RuntimeWarning: invalid value encountered in double_scalars
teststat = (qmu - qmu_A) / (2 * self.sqrtqmuA_v)
Observed Limit on µ: 2.89
-----
Expected (-2 σ) Limit on µ: 0.829
Expected (-1 σ) Limit on µ: 1.110
Expected Limit on µ: 1.542
Expected (1 σ) Limit on µ: 2.147
Expected (2 σ) Limit on µ: 2.882
Let us know if you have any further questions!

How to reduce tensorflow dataset input pipeline host device (cpu) time (currently ~40%)?

I am trying to replicate the resnet18 paper. Before running this on the full Image Net dataset on disk, I'm doing some evaluation runs with the publicly available imagenette/320px dataset from TFDS (much much smaller subset of imagenet with 10 classes, already in .tfrecord format._
Note: the full notebook to do training and tracing is available here: resnet18_baseline.ipynb Just switch to a GPU runtime and run all the cells. It's already set-up with tensorboard profiling on the second batch. (You can use TPU as well, but some keras.layers.experimental.preprocessing layers do not support TPU ops yet and you have to enable soft device placement. Please use a GPU).
Input Operations
read images from the input dataset. These images usually have got different dimensions and we need some crop function because input tensors can not have different dimensions for batching. Therefore, for training I use random crop and for testing/validation datasets a center crop.
random_crop_layer = keras.layers.experimental.preprocessing.RandomCrop(224, 224)
center_crop_layer = keras.layers.experimental.preprocessing.CenterCrop(224, 224)
#tf.function(experimental_relax_shapes=True) # avoid retracing
def train_crop_fn(x, y):
return random_crop_layer(x), y
#tf.function(experimental_relax_shapes=True)
def eval_crop_fn(x, y):
return center_crop_layer(x), y
Perform some simple preprocessing/augmentations to the input data. These include rescaling to 0-1 and also scaling based on mean and stdev of the rgb colours on imagenet. Also, random
rescaling_layer = keras.layers.experimental.preprocessing.Rescaling(1./255)
train_preproc = keras.Sequential([
rescaling_layer
])
# from https://github.com/tensorflow/models/blob/master/official/vision/image_classification/preprocessing.py
# Calculated from the ImageNet training set
MEAN_RGB = (0.485 , 0.456, 0.406)
STDDEV_RGB = (0.229, 0.224, 0.225)
#tf.function
def z_score_scale(x):
return (x - MEAN_RGB) / STDDEV_RGB
#tf.function
def train_preproc_fn(x, y):
return z_score_scale(train_preproc(x)), y
#tf.function
def eval_preproc_fn(x, y):
return z_score_scale(eval_preproc(x)), y
Input Pipeline
def get_input_pipeline(input_ds, bs, crop_fn, augmentation_fn):
ret_ds = (
input_ds
.batch(1) # pre-crop are different dimensions and can't be batched
.map(crop_fn,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
.unbatch()
.batch(bs)
.map(augmentation_fn, # augmentations can be batched though.
num_parallel_calls=tf.data.experimental.AUTOTUNE)
)
return ret_ds
# dataset loading
def load_imagenette():
train_ds, ds_info = tfds.load('imagenette/320px', split='train', as_supervised=True, with_info=True)
valid_ds = tfds.load('imagenette/320px', split='validation', as_supervised=True)
return train_ds, valid_ds, valid_ds, ds_info.features['label'].num_classes
# pipeline construction
train_ds, valid_ds, test_ds, num_classes = load_imagenette()
# datasets used for training (notice that I use prefetch here)
train_samples = get_input_pipeline(train_ds, BS, train_crop_fn, train_preproc_fn).prefetch(tf.data.experimental.AUTOTUNE)
valid_samples = get_input_pipeline(valid_ds, BS, eval_crop_fn, eval_preproc_fn).prefetch(tf.data.experimental.AUTOTUNE)
test_samples = get_input_pipeline(test_ds, BS, eval_crop_fn, eval_preproc_fn).prefetch(tf.data.experimental.AUTOTUNE)
Training and Profiling
I use tensorboard profiler to check the second batch size and I get a warning that this is highly input bound, with about 40% of processing wasted on inputs.
For a classic resnet18 model, you can drive the batch size up to 768 without getting a OOM error, which is what I use. A single step with bs 256 takes about 2-3 seconds.
I also get a warning that on_train_batch_size_end is slow, at around ~1.5 seconds, compared to the 1s batch time.
The model training code is very simple keras:
model.fit(
train_samples,
validation_data=valid_samples,
epochs=100,
batch_size=BS,
use_multiprocessing=True
callbacks=[tensorboard_callback, model_checkpoint_callback, early_stop_callback, reduce_lr_callback]
)
and the callbacks are specified as:
log_dir = os.path.join(os.getcwd(), 'logs')
tensorboard_callback = TensorBoard(log_dir=log_dir, update_freq="epoch", profile_batch=2)
reduce_lr_callback = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, min_lr=0.001, verbose=1)
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath='model.{epoch:02d}-{val_loss:.4f}.h5',
monitor='val_loss',
verbose=1,
save_best_only=True)
early_stop_callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=15)
Lastly, here are some sample tensorboard profiling screenshots. I can't figure out how to make this run faster:

How to calculate number of missing values summed over time dimension in a netcdf file in bash

I have a netcdf file with data as a function of lon,lat and time. I would like to calculate the total number of missing entries in each grid cell summed over the time dimension, preferably with CDO or NCO so I do not need to invoke R, python etc.
I know how to get the total number of missing values
ncap2 -s "nmiss=var.number_miss()" in.nc out.nc
as I answered to this related question:
count number of missing values in netcdf file - R
and CDO can tell me the total summed over space with
cdo info in.nc
but I can't work out how to sum over time. Is there a way for example of specifying the dimension to sum over with number_miss in ncap2?
We added the missing() function to ncap2 to solve this problem elegantly as of NCO 4.6.7 (May, 2017). To count missing values through time:
ncap2 -s 'mss_val=three_dmn_var_dbl.missing().ttl($time)' in.nc out.nc
Here ncap2 chains two methods together, missing(), followed by a total over the time dimension. The 2D variable mss_val is in out.nc. The response below does the same but averages over space and reports through time (because I misinterpreted the OP).
Old/obsolete answer:
There are two ways to do this with NCO/ncap2, though neither is as elegant as I would like. Either call assemble the answer one record at a time by calling num_miss() with one record at a time, or (my preference) use the boolean comparison function followed by the total operator along the axes of choice:
zender#aerosol:~$ ncap2 -O -s 'tmp=three_dmn_var_dbl;mss_val=tmp.get_miss();tmp.delete_miss();tmp_bool=(tmp==mss_val);tmp_bool_ttl=tmp_bool.ttl($lon,$lat);print(tmp_bool_ttl);' ~/nco/data/in.nc ~/foo.nc
tmp_bool_ttl[0]=0
tmp_bool_ttl[1]=0
tmp_bool_ttl[2]=0
tmp_bool_ttl[3]=8
tmp_bool_ttl[4]=0
tmp_bool_ttl[5]=0
tmp_bool_ttl[6]=0
tmp_bool_ttl[7]=1
tmp_bool_ttl[8]=0
tmp_bool_ttl[9]=2
or
zender#aerosol:~$ ncap2 -O -s 'for(rec=0;rec<time.size();rec++){nmiss=three_dmn_var_int(rec,:,:).number_miss();print(nmiss);}' ~/nco/data/in.nc ~/foo.nc
nmiss = 0
nmiss = 0
nmiss = 8
nmiss = 0
nmiss = 0
nmiss = 1
nmiss = 0
nmiss = 2
nmiss = 1
nmiss = 2
Even though you are asking for another solution, I would like to show you that it takes only one very short line to find the answer with the help of Python. The variable m_data has exactly the same shape as a variable with missing values read using the netCDF4 package. With the execution of only one np.sum command with the correct axis specified, you have your answer.
import numpy as np
import matplotlib.pyplot as plt
import netCDF4 as nc4
# Generate random data for this experiment.
data = np.random.rand(365, 64, 128)
# Masked data, this is how the data is read from NetCDF by the netCDF4 package.
# For this example, I mask all values less than 0.1.
m_data = np.ma.masked_array(data, mask=data<0.1)
# It only takes one operation to find the answer.
n_values_missing = np.sum(m_data.mask, axis=0)
# Just a plot of the result.
plt.figure()
plt.pcolormesh(n_values_missing)
plt.colorbar()
plt.xlabel('lon')
plt.ylabel('lat')
plt.show()
# Save a netCDF file of the results.
f = nc4.Dataset('test.nc', 'w', format='NETCDF4')
f.createDimension('lon', 128)
f.createDimension('lat', 64 )
n_values_missing_nc = f.createVariable('n_values_missing', 'i4', ('lat', 'lon'))
n_values_missing_nc[:,:] = n_values_missing[:,:]
f.close()

Error in setting max features parameter in Isolation Forest algorithm using sklearn

I'm trying to train a dataset with 357 features using Isolation Forest sklearn implementation. I can successfully train and get results when the max features variable is set to 1.0 (the default value).
However when max features is set to 2, it gives the following error:
ValueError: Number of features of the model must match the input.
Model n_features is 2 and input n_features is 357
It also gives the same error when the feature count is 1 (int) and not 1.0 (float).
How I understood was that when the feature count is 2 (int), two features should be considered in creating each tree. Is this wrong? How can I change the max features parameter?
The code is as follows:
from sklearn.ensemble.iforest import IsolationForest
def isolation_forest_imp(dataset):
estimators = 10
samples = 100
features = 2
contamination = 0.1
bootstrap = False
random_state = None
verbosity = 0
estimator = IsolationForest(n_estimators=estimators, max_samples=samples, contamination=contamination,
max_features=features,
bootstrap=boostrap, random_state=random_state, verbose=verbosity)
model = estimator.fit(dataset)
In the documentation it states:
max_features : int or float, optional (default=1.0)
The number of features to draw from X to train each base estimator.
- If int, then draw `max_features` features.
- If float, then draw `max_features * X.shape[1]` features.
So, 2 should mean take two features and 1.0 should mean take all of the features, 0.5 take half and so on, from what I understand.
I think this could be a bug, since, taking a look in IsolationForest's fit:
# Isolation Forest inherits from BaseBagging
# and when _fit is called, BaseBagging takes care of the features correctly
super(IsolationForest, self)._fit(X, y, max_samples,
max_depth=max_depth,
sample_weight=sample_weight)
# however, when after _fit the decision_function is called using X - the whole sample - not taking into account the max_features
self.threshold_ = -sp.stats.scoreatpercentile(
-self.decision_function(X), 100. * (1. - self.contamination))
then:
# when the decision function _validate_X_predict is called, with X unmodified,
# it calls the base estimator's (dt) _validate_X_predict with the whole X
X = self.estimators_[0]._validate_X_predict(X, check_input=True)
...
# from tree.py:
def _validate_X_predict(self, X, check_input):
"""Validate X whenever one tries to predict, apply, predict_proba"""
if self.tree_ is None:
raise NotFittedError("Estimator not fitted, "
"call `fit` before exploiting the model.")
if check_input:
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
if issparse(X) and (X.indices.dtype != np.intc or
X.indptr.dtype != np.intc):
raise ValueError("No support for np.int64 index based "
"sparse matrices")
# so, this check fails because X is the original X, not with the max_features applied
n_features = X.shape[1]
if self.n_features_ != n_features:
raise ValueError("Number of features of the model must "
"match the input. Model n_features is %s and "
"input n_features is %s "
% (self.n_features_, n_features))
return X
So, I am not sure on how you can handle this. Maybe figure out the percentage that leads to just the two features you need - even though I am not sure it'll work as expected.
Note: I am using scikit-learn v.0.18
Edit: as #Vivek Kumar commented this is an issue and upgrading to 0.20 should do the trick.

How to normalize an image using Octave?

In their paper describing Viola-Jones object detection framework (Robust Real-Time Face Detection by Viola and Jones), it is said:
All example sub-windows used for training were variance normalized to minimize the effect of different lighting conditions.
My question is "How to implement image normalization in Octave?"
I'm NOT looking for the specific implementation that Viola & Jones used but a similar one that produces almost the same output. I've been following a lot of haar-training tutorials(trying to detect a hand) but not yet able to output a good detector(xml).
I've tried contacting the authors, but still no response yet.
I already answered how to to it in general guidelines in this thread.
Here is how to do method 1 (normalizing to standard normal deviation) in octave (Demonstrating for a random matrix A, of course can be applied to any matrix, which is how the picture is represented):
>>A = rand(5,5)
A =
0.078558 0.856690 0.077673 0.038482 0.125593
0.272183 0.091885 0.495691 0.313981 0.198931
0.287203 0.779104 0.301254 0.118286 0.252514
0.508187 0.893055 0.797877 0.668184 0.402121
0.319055 0.245784 0.324384 0.519099 0.352954
>>s = std(A(:))
s = 0.25628
>>u = mean(A(:))
u = 0.37275
>>A_norn = (A - u) / s
A_norn =
-1.147939 1.888350 -1.151395 -1.304320 -0.964411
-0.392411 -1.095939 0.479722 -0.229316 -0.678241
-0.333804 1.585607 -0.278976 -0.992922 -0.469159
0.528481 2.030247 1.658861 1.152795 0.114610
-0.209517 -0.495419 -0.188723 0.571062 -0.077241
In the above you use:
To get the standard deviation of the matrix: s = std(A(:))
To get the mean value of the matrix: u = mean(A(:))
And then following the formula A'[i][j] = (A[i][j] - u)/s with the
vectorized version: A_norm = (A - u) / s
Normalizing it with vector normalization is also simple:
>>abs = sqrt((A(:))' * (A(:)))
abs = 2.2472
>>A_norm = A / abs
A_norm =
0.034959 0.381229 0.034565 0.017124 0.055889
0.121122 0.040889 0.220583 0.139722 0.088525
0.127806 0.346703 0.134059 0.052637 0.112369
0.226144 0.397411 0.355057 0.297343 0.178945
0.141980 0.109375 0.144351 0.231000 0.157065
In the abvove:
abs is the absolute value of the vector (its length), which is calculated with vectorized multiplications (A(:)' * A(:) is actually sum(A[i][j]^2))
Then we use it to normalize the vector so it will be of length 1.

Resources