I am getting error mentioned in the title and didn't find yet a solution.
X = train[feats].values
y = train['Target'].values
cv = StratifiedKFold(n_splits=3, random_state=2021, shuffle=True)
model = LogisticRegression(solver='liblinear')
scores = []
for train_idx, test_idx in cv.split(X, y):
model.fit(X[train_idx], y[train_idx])
y_pred = model.predict(X[test_idx])
score = mean_absolute_error(y[test_idx], y_pred )
scores.append(score)
print(np.mean(scores), np.std(scores))
fig = plt.figure(figsize=(15,6));
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
skplt.metrics.plot_confusion_matrix(y, y_pred, ax = ax1) #error line
skplt.metrics.plot_roc(y, y_pred, ax = ax2)
ValueError: Found input variables with inconsistent numbers of samples: [32561, 10853]
I checked the code, read many threads on this error. Somebody suggested me as a solution to put the cross-validation in a loop, but I don't know how to manage this with code (and also which part of operation to put in a loop, and how to write a condition that should be ending this loop). Please, help me with a specific answer that will help me to easily fix problem with my current level of advancement.
I'm trying to learn more about pyhf and my understanding of what the goals are might be limited. I would love to fit my HEP data outside of ROOT, but I could be imposing expectations on pyhf which are not what the authors intended for it's use.
I'd like to write myself a hello-world example, but I might just not know what I'm doing. My misunderstanding could also be gaps in my statistical knowledge.
With that preface, let me explain what I'm trying to explore.
I have some observed set of events for which I calculate some observable and make a binned histogram of that data. I hypothesize that there are two contributing physics processes, which I call signal and background. I generate some Monte Carlo samples for these processes and the theorized total number of events is close to, but not exactly what I observe.
I would like to:
Fit the data to this two process hypothesis
Get from the fit the optimal values for the number of events for each process
Get the uncertainties on these fitted values
If appropriate, calculate an upper limit on the number of signal events.
My starter code is below, where all I'm doing is an ML fit but I'm not sure where to go. I know it's not set up to do what I want, but I'm getting lost in the examples I find on RTD. I'm sure it's me, this is not a criticism of the documentation.
import pyhf
import numpy as np
import matplotlib.pyplot as plt
nbins = 15
# Generate a background and signal MC sample`
MC_signal_events = np.random.normal(5,1.0,200)
MC_background_events = 10*np.random.random(1000)
signal_data = np.histogram(MC_signal_events,bins=nbins)[0]
bkg_data = np.histogram(MC_background_events,bins=nbins)[0]
# Generate an observed dataset with a slightly different
# number of events
signal_events = np.random.normal(5,1.0,180)
background_events = 10*np.random.random(1050)
observed_events = np.array(signal_events.tolist() + background_events.tolist())
observed_sample = np.histogram(observed_events,bins=nbins)[0]
# Plot these samples, if you like
plt.figure(figsize=(12,4))
plt.subplot(1,3,1)
plt.hist(observed_events,bins=nbins,label='Observations')
plt.legend()
plt.subplot(1,3,2)
plt.hist(MC_signal_events,bins=nbins,label='MC signal')
plt.legend()
plt.subplot(1,3,3)
plt.hist(MC_background_events,bins=nbins,label='MC background')
plt.legend()
# Use a very naive estimate of the background
# uncertainties
bkg_uncerts = np.sqrt(bkg_data)
print("Defining the PDF.......")
pdf = pyhf.simplemodels.hepdata_like(signal_data=signal_data.tolist(), \
bkg_data=bkg_data.tolist(), \
bkg_uncerts=bkg_uncerts.tolist())
print("Fit.......")
data = pyhf.tensorlib.astensor(observed_sample.tolist() + pdf.config.auxdata)
bestfit_pars, twice_nll = pyhf.infer.mle.fit(data, pdf, return_fitted_val=True)
print(bestfit_pars)
print(twice_nll)
plt.show()
Note: this answer is based on pyhf v0.5.2.
Alright, so it looks like you've managed to figure most of the big pieces for sure. However, there's two different ways to do this depending on how you prefer to set things up. In both cases, I assume you want an unconstrained fit and you want to...
fit your signal+background model to observed data
fit your background model to observed data
First, let's discuss uncertainties briefly. At the moment, we default to numpy for the tensor background and scipy for the optimizer. See documentation:
numpy backend
scipy optimizer
However, one unfortunate drawback right now with the scipy optimizer is that it cannot return the uncertainties. What you need to do anywhere in your code before the fit (although we generally recommend as early as possible) is to use the minuit optimizer instead:
pyhf.set_backend('numpy', 'minuit')
This will get you the nice features of being able to get the correlation matrix, the uncertainties on the fitted parameters, and the hessian -- amongst other things. We're working to make this consistent for scipy as well, but this is not ready right now.
All optimizations go through our optimizer API which you can currently view through the mixin here in our documentation. Specifically, the signature is
minimize(
objective,
data,
pdf,
init_pars,
par_bounds,
fixed_vals=None,
return_fitted_val=False,
return_result_obj=False,
do_grad=None,
do_stitch=False,
**kwargs)
There are a lot of options here. Let's just focus on the fact that one of the keyword arguments we can pass through is return_uncertainties which will change the bestfit parameters by adding a column for the fitted parameter uncertainty which you want.
1. Signal+Background
In this case, we want to just use the default model
result, twice_nll = pyhf.infer.mle.fit(
data,
pdf,
return_uncertainties=True,
return_fitted_val=True
)
bestfit_pars, errors = result.T
2. Background-Only
In this case, we need to turn off the signal. The way we do this is by setting the parameter of interest (POI) fixed to 0.0. Then we can get the fitted parameters for the background-only model in a similar way, but using fixed_poi_fit instead of an unconstrained fit:
result, twice_nll = pyhf.infer.mle.fixed_poi_fit(
0.0,
data,
pdf,
return_uncertainties=True,
return_fitted_val=True
)
bestfit_pars, errors = result.T
Note that this is quite simply a quick way of doing the following unconstrained fit
bkg_params = pdf.config.suggested_init()
fixed_params = pdf.config.suggested_fixed()
bkg_params[pdf.config.poi_index] = 0.0
fixed_params[pdf.config.poi_index] = True
result, twice_nll = pyhf.infer.mle.fit(
data,
pdf,
init_pars=bkg_params,
fixed_params=fixed_params,
return_uncertainties=True,
return_fitted_val=True
)
bestfit_pars, errors = result.T
Hopefully that clarifies things up more!
Giordon's solution should answer all of your question, but I thought I'd also write out the code to basically address everything we can.
I also take the liberty of changing some of your values a bit so that the signal isn't so strong that the observed CLs value isn't far off to the right of the Brazil band (the results aren't wrong obviously, but it probably makes more sense to be talking about using the discovery test statistic at that point then setting limits. :))
Environment
For this example I'm going to setup a clean Python 3 virtual environment and then install the dependencies (here we're going to be using pyhf v0.5.2)
$ python3 -m venv "${HOME}/.venvs/question"
$ . "${HOME}/.venvs/question/bin/activate"
(question) $ cat requirements.txt
pyhf[minuit,contrib]~=0.5.2
black
(question) $ python -m pip install -r requirements.txt
Code
While we can't easily get the best fit value for both the number of signal events as well as the background events we definitely can do inference to get the best fit value for the signal strength.
The following chunk of code (which is long only because of the visualization) should address all of the points of your question.
# answer.py
import numpy as np
import pyhf
import matplotlib.pyplot as plt
import pyhf.contrib.viz.brazil
# Goals:
# - Fit the model to the observed data
# - Infer the best fit signal strength given the model
# - Get the uncertainties on the best fit signal strength
# - Calculate an 95% CL upper limit on the signal strength
def plot_hist(ax, bins, data, bottom=0, color=None, label=None):
bin_width = bins[1] - bins[0]
bin_leftedges = bins[:-1]
bin_centers = [edge + bin_width / 2.0 for edge in bin_leftedges]
ax.bar(
bin_centers, data, bin_width, bottom=bottom, alpha=0.5, color=color, label=label
)
def plot_data(ax, bins, data, label="Data"):
bin_width = bins[1] - bins[0]
bin_leftedges = bins[:-1]
bin_centers = [edge + bin_width / 2.0 for edge in bin_leftedges]
ax.scatter(bin_centers, data, color="black", label=label)
def invert_interval(test_mus, hypo_tests, test_size=0.05):
# This will be taken care of in v0.5.3
cls_obs = np.array([test[0] for test in hypo_tests]).flatten()
cls_exp = [
np.array([test[1][idx] for test in hypo_tests]).flatten() for idx in range(5)
]
crossing_test_stats = {"exp": [], "obs": None}
for cls_exp_sigma in cls_exp:
crossing_test_stats["exp"].append(
np.interp(
test_size, list(reversed(cls_exp_sigma)), list(reversed(test_mus))
)
)
crossing_test_stats["obs"] = np.interp(
test_size, list(reversed(cls_obs)), list(reversed(test_mus))
)
return crossing_test_stats
def main():
np.random.seed(0)
pyhf.set_backend("numpy", "minuit")
observable_range = [0.0, 10.0]
bin_width = 0.5
_bins = np.arange(observable_range[0], observable_range[1] + bin_width, bin_width)
n_bkg = 2000
n_signal = int(np.sqrt(n_bkg))
# Generate simulation
bkg_simulation = 10 * np.random.random(n_bkg)
signal_simulation = np.random.normal(5, 1.0, n_signal)
bkg_sample, _ = np.histogram(bkg_simulation, bins=_bins)
signal_sample, _ = np.histogram(signal_simulation, bins=_bins)
# Generate observations
signal_events = np.random.normal(5, 1.0, int(n_signal * 0.8))
bkg_events = 10 * np.random.random(int(n_bkg + np.sqrt(n_bkg)))
observed_events = np.array(signal_events.tolist() + bkg_events.tolist())
observed_sample, _ = np.histogram(observed_events, bins=_bins)
# Visualize the simulation and observations
fig, ax = plt.subplots()
fig.set_size_inches(7, 5)
plot_hist(ax, _bins, bkg_sample, label="Background")
plot_hist(ax, _bins, signal_sample, bottom=bkg_sample, label="Signal")
plot_data(ax, _bins, observed_sample)
ax.legend(loc="best")
ax.set_ylim(top=np.max(observed_sample) * 1.4)
ax.set_xlabel("Observable")
ax.set_ylabel("Count")
fig.savefig("components.png")
# Build the model
bkg_uncerts = np.sqrt(bkg_sample)
model = pyhf.simplemodels.hepdata_like(
signal_data=signal_sample.tolist(),
bkg_data=bkg_sample.tolist(),
bkg_uncerts=bkg_uncerts.tolist(),
)
data = pyhf.tensorlib.astensor(observed_sample.tolist() + model.config.auxdata)
# Perform inference
fit_result = pyhf.infer.mle.fit(data, model, return_uncertainties=True)
bestfit_pars, par_uncerts = fit_result.T
print(
f"best fit parameters:\
\n * signal strength: {bestfit_pars[0]} +/- {par_uncerts[0]}\
\n * nuisance parameters: {bestfit_pars[1:]}\
\n * nuisance parameter uncertainties: {par_uncerts[1:]}"
)
# Perform hypothesis test scan
_start = 0.0
_stop = 5
_step = 0.1
poi_tests = np.arange(_start, _stop + _step, _step)
print("\nPerforming hypothesis tests\n")
hypo_tests = [
pyhf.infer.hypotest(
mu_test,
data,
model,
return_expected_set=True,
return_test_statistics=True,
qtilde=True,
)
for mu_test in poi_tests
]
# Upper limits on signal strength
results = invert_interval(poi_tests, hypo_tests)
print(f"Observed Limit on µ: {results['obs']:.2f}")
print("-----")
for idx, n_sigma in enumerate(np.arange(-2, 3)):
print(
"Expected {}Limit on µ: {:.3f}".format(
" " if n_sigma == 0 else "({} σ) ".format(n_sigma),
results["exp"][idx],
)
)
# Visualize the "Brazil band"
fig, ax = plt.subplots()
fig.set_size_inches(7, 5)
ax.set_title("Hypothesis Tests")
ax.set_ylabel(r"$\mathrm{CL}_{s}$")
ax.set_xlabel(r"$\mu$")
pyhf.contrib.viz.brazil.plot_results(ax, poi_tests, hypo_tests)
fig.savefig("brazil_band.png")
if __name__ == "__main__":
main()
which when run gives
(question) $ python answer.py
best fit parameters:
* signal strength: 1.5884737977889158 +/- 0.7803435235862329
* nuisance parameters: [0.99020988 1.06040191 0.90488207 1.03531383 1.09093327 1.00942088
1.07789316 1.01125627 1.06202964 0.95780043 0.94990993 1.04893286
1.0560711 0.9758487 0.93692481 1.04683181 1.05785515 0.92381263
0.93812855 0.96751869]
* nuisance parameter uncertainties: [0.06966439 0.07632218 0.0611428 0.07230328 0.07872258 0.06899675
0.07472849 0.07403246 0.07613661 0.08606657 0.08002775 0.08655314
0.07564512 0.07308117 0.06743479 0.07383134 0.07460864 0.06632003
0.06683251 0.06270965]
Performing hypothesis tests
/home/stackoverflow/.venvs/question/lib/python3.7/site-packages/pyhf/infer/calculators.py:229: RuntimeWarning: invalid value encountered in double_scalars
teststat = (qmu - qmu_A) / (2 * self.sqrtqmuA_v)
Observed Limit on µ: 2.89
-----
Expected (-2 σ) Limit on µ: 0.829
Expected (-1 σ) Limit on µ: 1.110
Expected Limit on µ: 1.542
Expected (1 σ) Limit on µ: 2.147
Expected (2 σ) Limit on µ: 2.882
Let us know if you have any further questions!
I have read a lot about the pain of replicate the easy robust option from STATA to R to use robust standard errors. I replicated following approaches: StackExchange and Economic Theory Blog. They work but the problem I face is, if I want to print my results using the stargazer function (this prints the .tex code for Latex files).
Here is the illustration to my problem:
reg1 <-lm(rev~id + source + listed + country , data=data2_rev)
stargazer(reg1)
This prints the R output as .tex code (non-robust SE) If i want to use robust SE, i can do it with the sandwich package as follow:
vcov <- vcovHC(reg1, "HC1")
if I now use stargazer(vcov) only the output of the vcovHC function is printed and not the regression output itself.
With the package lmtest() it is possible to print at least the estimator, but not the observations, R2, adj. R2, Residual, Residual St.Error and the F-Statistics.
lmtest::coeftest(reg1, vcov. = sandwich::vcovHC(reg1, type = 'HC1'))
This gives the following output:
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.54923 6.85521 -0.3719 0.710611
id 0.39634 0.12376 3.2026 0.001722 **
source 1.48164 4.20183 0.3526 0.724960
country -4.00398 4.00256 -1.0004 0.319041
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How can I add or get an output with the following parameters as well?
Residual standard error: 17.43 on 127 degrees of freedom
Multiple R-squared: 0.09676, Adjusted R-squared: 0.07543
F-statistic: 4.535 on 3 and 127 DF, p-value: 0.00469
Did anybody face the same problem and can help me out?
How can I use robust standard errors in the lm function and apply the stargazer function?
You already calculated robust standard errors, and there's an easy way to include it in the stargazeroutput:
library("sandwich")
library("plm")
library("stargazer")
data("Produc", package = "plm")
# Regression
model <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc,
index = c("state","year"),
method="pooling")
# Adjust standard errors
cov1 <- vcovHC(model, type = "HC1")
robust_se <- sqrt(diag(cov1))
# Stargazer output (with and without RSE)
stargazer(model, model, type = "text",
se = list(NULL, robust_se))
Solution found here: https://www.jakeruss.com/cheatsheets/stargazer/#robust-standard-errors-replicating-statas-robust-option
Update I'm not so much into F-Tests. People are discussing those issues, e.g. https://stats.stackexchange.com/questions/93787/f-test-formula-under-robust-standard-error
When you follow http://www3.grips.ac.jp/~yamanota/Lecture_Note_9_Heteroskedasticity
"A heteroskedasticity-robust t statistic can be obtained by dividing an OSL estimator by its robust standard error (for zero null hypotheses). The usual F-statistic, however, is invalid. Instead, we need to use the heteroskedasticity-robust Wald statistic."
and use a Wald statistic here?
This is a fairly simple solution using coeftest:
reg1 <-lm(rev~id + source + listed + country , data=data2_rev)
cl_robust <- coeftest(reg1, vcov = vcovCL, type = "HC1", cluster = ~
country)
se_robust <- cl_robust[, 2]
stargazer(reg1, reg1, cl_robust, se = list(NULL, se_robust, NULL))
Note that I only included cl_robust in the output as a verification that the results are identical.
I have to deal with very big data (Point clouds generally more than 30 000 000 points) using Matlab. I can read ascii data using textscan function. After reading, I need to detect invalid data (points with 0,0,0 coordinates) and then I need to do some mathematical operations on each point or each line in the data. In my way, first I read data with textscan and then I assign this data to a matrix. Secondly, I use for loops for detecting invalid points and doing some mathematical operations on each point or line in the data. A sample of my code is shown as below. According to profile tool of Matlab textscan takes 37% and line
transformed_list((i:i),(1:4)) = coordinate_list((i:i),(1:4))*t_matrix;
takes 35% of all computation time.
I tried it with another point cloud (stores around 5 500 000) and profile tool reported same results. Is there a way of avoiding for loops, or is there another way of speeding up this computation?
fileID = fopen('C:\Users\Mustafa\Desktop\ptx_all_data\dede5.ptx');
original_data = textscan(fileID,'%f %f %f %f %f %f %f', 'delimiter',' ');
fclose(fileID);
column = original_data{1}(1);
row = original_data{1}(2);
t_matrix = [original_data{1}(7) original_data{2}(7) original_data{3}(7) original_data{4}(7)
original_data{1}(8) original_data{2}(8) original_data{3}(8) original_data{4}(8)
original_data{1}(9) original_data{2}(9) original_data{3}(9) original_data{4}(9)
original_data{1}(10) original_data{2}(10) original_data{3}(10) original_data{4}(10)];
coordinate_list(:,1) = original_data{1}(11:length(original_data{1}));
coordinate_list(:,2) = original_data{2}(11:length(original_data{2}));
coordinate_list(:,3) = original_data{3}(11:length(original_data{3}));
coordinate_list(:,4) = 0;
coordinate_list(:,5) = original_data{4}(11:length(original_data{4}));
transformed_list = zeros(length(coordinate_list),5);
for i = 1:length(coordinate_list)
if coordinate_list(i,1) == 0 && coordinate_list(i,2) == 0 && coordinate_list(i,3) == 0
transformed_list(i,:) = NaN;
else
%transformed_list(i,:) = coordinate_list(i,:)*t_matrix;
transformed_list((i:i),(1:4)) = coordinate_list((i:i),(1:4))*t_matrix;
transformed_list(i,5) = coordinate_list(i,5);
end
%i
end
Thanks in advance
for loops with conditional statements like those will take ages to run. But what Matlab lacks in loop speed it makes up with vectorization and indexing.
Let's try some logical indexing like this to solve the first step:
coordinate_list(coordinate_list(:,1) == 0 .* ...
coordinate_list(:,2) == 0 .* ...
coordinate_list(:,3) == 0)=nan;
And then vectorize the second statement:
transformed_list(:,(1:4)) = coordinate_list(:,(1:4))*t_matrix;
As EBH mentioned above this might be a bit heavy on your RAM. If it's more than your computer can handle asks yourself if the coordinates really have to be doubles, maybe single precision will do. If that still doesn't do, try slicing the vector and performing the operation in parts.
Small example to give you an idea because I had a 2million element point cloud around here:
In R2015a
transformed_list = zeros(length(coordinate_list),5);
tic
for i = 1:length(coordinate_list)
if coordinate_list(i,1) == 0 && coordinate_list(i,2) == 0 && coordinate_list(i,3) == 0
transformed_list(i,:) = NaN;
else
%transformed_list(i,:) = coordinate_list(i,:)*t_matrix;
transformed_list((i:i),(1:3)) = coordinate_list((i:i),(1:3))*t_matrix;
transformed_list(i,5) = 1;
end
%i
end
toc
Returns Elapsed time is 10.928142 seconds.
transformed_list=coordinate_list;
tic
coordinate_list(coordinate_list(:,1) == 0 .* ...
coordinate_list(:,2) == 0 .* ...
coordinate_list(:,3) == 0)=nan;
transformed_list(:,(1:3)) = coordinate_list(:,(1:3))*t_matrix;
toc
Returns Elapsed time is 0.101696 seconds.
Rather than read the whole file, you'd be better off using a loop with
fscanf(fileID, '%f', 7)
and processing input as you read it.
In their paper describing Viola-Jones object detection framework (Robust Real-Time Face Detection by Viola and Jones), it is said:
All example sub-windows used for training were variance normalized to minimize the effect of different lighting conditions.
My question is "How to implement image normalization in Octave?"
I'm NOT looking for the specific implementation that Viola & Jones used but a similar one that produces almost the same output. I've been following a lot of haar-training tutorials(trying to detect a hand) but not yet able to output a good detector(xml).
I've tried contacting the authors, but still no response yet.
I already answered how to to it in general guidelines in this thread.
Here is how to do method 1 (normalizing to standard normal deviation) in octave (Demonstrating for a random matrix A, of course can be applied to any matrix, which is how the picture is represented):
>>A = rand(5,5)
A =
0.078558 0.856690 0.077673 0.038482 0.125593
0.272183 0.091885 0.495691 0.313981 0.198931
0.287203 0.779104 0.301254 0.118286 0.252514
0.508187 0.893055 0.797877 0.668184 0.402121
0.319055 0.245784 0.324384 0.519099 0.352954
>>s = std(A(:))
s = 0.25628
>>u = mean(A(:))
u = 0.37275
>>A_norn = (A - u) / s
A_norn =
-1.147939 1.888350 -1.151395 -1.304320 -0.964411
-0.392411 -1.095939 0.479722 -0.229316 -0.678241
-0.333804 1.585607 -0.278976 -0.992922 -0.469159
0.528481 2.030247 1.658861 1.152795 0.114610
-0.209517 -0.495419 -0.188723 0.571062 -0.077241
In the above you use:
To get the standard deviation of the matrix: s = std(A(:))
To get the mean value of the matrix: u = mean(A(:))
And then following the formula A'[i][j] = (A[i][j] - u)/s with the
vectorized version: A_norm = (A - u) / s
Normalizing it with vector normalization is also simple:
>>abs = sqrt((A(:))' * (A(:)))
abs = 2.2472
>>A_norm = A / abs
A_norm =
0.034959 0.381229 0.034565 0.017124 0.055889
0.121122 0.040889 0.220583 0.139722 0.088525
0.127806 0.346703 0.134059 0.052637 0.112369
0.226144 0.397411 0.355057 0.297343 0.178945
0.141980 0.109375 0.144351 0.231000 0.157065
In the abvove:
abs is the absolute value of the vector (its length), which is calculated with vectorized multiplications (A(:)' * A(:) is actually sum(A[i][j]^2))
Then we use it to normalize the vector so it will be of length 1.