I have set up a training data set (a DMatrix) for which I want to perform k-fold cross-validation.
train_data = xgb.DMatrix(data = X_Train, label = y_Train, weight = training_weights)
My question is: When i run a model using xgboost.cv(), will each fold have the same weight, or the same number of weighted observations?
Related
So I'm working on a project and I have a set of data that I loaded in as a csv. The data has a spot that that I need to flatten out. I used the numpy.polyfit() function to find a line of best fit, but what I can't seem to figure out is how to subtract off the best fit line. Any advice?
Here is the code I'm using so far:
μ = pd.read_csv("C:\\Users\\ander\\Documents\\Data\\plots and code\\dataframe2.csv")
yvalue = "average"
xvalue = "xvalue"
X = μ[xvalue][173:852]
Y = μ[yvalue][173:852]
fit = np.polyfit(X, Y, 1)
μ = μ.subtract(fit, μ)
The polyfit function finds the linear coefficient of the best fit. In order to subtract the line from your data, you first need to create the linear function itself. For example, you can use the numpy.poly1d function.
I'll show you an example. Since we don't have access to the .csv file I made up X and Y:
import matplotlib.pyplot as plt
import numpy as np
DATA_SIZE = 500
μ_X = np.sort(np.random.uniform(0,10,DATA_SIZE))
μ_Y = 3*np.exp(-(μ_X-7)**2) + np.random.normal(0,0.08,DATA_SIZE) + 0.5*μ_X
X = μ_X[50:200]
Y = μ_Y[50:200]
plt.scatter(μ_X, μ_Y, label='Full data')
plt.scatter(X, Y, label='Selected region')
plt.legend()
plt.show()
Now we can fit the baseline from the orange data and subtract the linear function from all the data (blue).
fit = np.polyfit(X, Y, 1)
linear_baseline = np.poly1d(fit) # create the linear baseline function
μ_Y = μ_Y - linear_baseline(μ_X) # subtract the baseline from μ_Y
plt.scatter(μ_X, μ_Y, label='Linear baseline removed')
plt.legend()
plt.show()
I would like to calculate the sensitivity-specificity sum maximization threshold (Youden Index) for my glm model:
model = glm(present ~ Summer_precipitation + Summer_temperature + Frost_days + Snowcover_days + Forest_presence + Population_density + Tick_density + Vaccination_coverage, family = binomial(link = "logit"), data = tbe_data)
I calculated predicted probabilities for the model. "Weather data" is a stacked raster file of all the covariate rasters listed in the model above.
#create predictions based on weather data
predictions=predict(weather_data,model,type="response")
#plot predictions
plot(predictions)
How can I now calculate the optimal probability cutoff point from the model? I would use the "cutpointr" function but don't know how to adapt the code to my situation
I have a dataset where I am trying to predict the type of car based off of a number of features. I would like to an OLS regression to see
import statsmodels.api as sm
X = features
# where 0 = sedan, 1 = minivan , etc
y = [0,0,1,0,2,....]
X2 = sm.add_constant(np.array(X))
est = sm.OLS(np.array(y), X2)
est2 = est.fit()
^ I don't feel like doing this is correct because I am not specifying that it is categorical, I feel like the functional form should change. Was wondering if anyone had any insight on this.
Ordinary least squares regression assumes a numerical dependent variable, you cannot use it to predict categorical outcomes.
To predict categorical outcomes with a regression model, you want to use multinomial logistic regression, for example using sklearn.
I simulated an ARMA Process and tried to forecast it with statsmodels.
I plotted the true value and the forecasted values.
I read that out-of-sample forecasts tend to converge to the sample mean for a long forecasting period. Can someone describe how this forecasts are calculated? I read in the documentation that they transform the arma model into a state space model and then forecast the next value via kalman filter, is this correct? Is then the calculated value t+1 used for the next prediction t+2?
However if I created another AR process and shifted it upwards
and mix the two arrays and then again do a forecast.
How does statsmodels ARIMA forecast "learn" the ups and downs of the data?
import matplotlib.pyplot as plt
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.tsa.arima_model import ARIMA
ar = np.array([1, -0.9])
ma = np.array([1])
AR_object = ArmaProcess(ar, ma)
a = AR_object.generate_sample(nsample=200)
b = AR_object.generate_sample(nsample=200)+10
ab = [item for sublist in zip(a, b) for item in sublist]
train = a[:100]
test = a[100:]
train_ab = ab[:100]
test_ab = ab[100:200]
plt.plot(a)
plt.plot(b)
model = ARIMA(train, order=(1,0,1))
model_fit = model.fit()
forecast = model_fit.forecast(steps=len(test))[0]
plt.figure()
plt.plot(test, color= 'green')
plt.plot(forecast, color='red')
plt.figure()
model = ARIMA(train_ab, order=(1,0,1))
model_fit = model.fit()
forecast_ab = model_fit.forecast(steps=len(test_ab))[0]
plt.plot(test_ab, color= 'green')
plt.plot(forecast_ab, color='red')
I know H2O can use
model_perf = model.model_performance(input)
model_perf.confusion_matrix
to output the confusion matrix. But is there a way to get the confusion matrix table to create plot?
You have the function you need as indicated here. So you just need to convert the output of your H2OFrames to a Pandas Dataframe. Example is shown below:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels
%matplotlib inline
h2o.init()
h2o.cluster().show_status()
# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
# print(cars["economy_20mpg"].isna().sum())
cars[~cars["economy_20mpg"].isna()]["economy_20mpg"].isna().sum()
cars = cars[~cars["economy_20mpg"].isna()]
# convert response column to a factor
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
# set the predictor names and the response column name
predictors = ["displacement","power","weight","acceleration","year"]
response = "economy_20mpg"
# split into train and validation sets
train, valid = cars.split_frame(ratios = [.8], seed = 1234)
# try using the `y` parameter:
# first initialize your estimator
cars_gbm = H2OGradientBoostingEstimator(seed = 1234, sample_rate=.5)
# then train your model, where you specify your 'x' predictors, your 'y' the response column
# training_frame and validation_frame
cars_gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
function from sklearn:
def plot_confusion_matrix(y_true, y_pred, classes,
normalize=False,
title=None,
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if not title:
if normalize:
title = 'Normalized confusion matrix'
else:
title = 'Confusion matrix, without normalization'
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Only use the labels that appear in the data
classes = classes[unique_labels(y_true, y_pred)]
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.figure.colorbar(im, ax=ax)
# We want to show all ticks...
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=classes, yticklabels=classes,
title=title,
ylabel='True label',
xlabel='Predicted label')
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
return ax
extract values
# specify the threshold you want to use to create integer labels
maxf1_threshold = cars_gbm.find_threshold_by_max_metric('f1')
# specify your tru and prediciton labels
y_true = cars["economy_20mpg"].as_data_frame()
y_pred = cars_gbm.predict(cars)
# convert prediction labels (original uncalibrated probabilities into integer labels)
y_pred = (y_pred['p1'] >= maxf1_threshold).ifelse(1,0)
y_pred = y_pred.as_data_frame()
y_pred.columns = ['p1']
y_true1 = y_true.economy_20mpg
y_pred1 = y_pred.p1
class_names = np.array(cars["economy_20mpg"].levels()[0])
# Plot non-normalized confusion matrix
plot_confusion_matrix(y_true1, y_pred1, classes=class_names,
title='Confusion matrix')
image result:
Please note that there is a bug in the H2O-3 confusion matrix that has been noted here