How do you retrain a Automl model with "new" set of data? - h2o

I have a train set of 1000 rows (1 row per day for example).
I get a prediction for a set of 5 futures (model.predict).
Over the next 5 days, I actually get the data for next 5 days (numbers (sales for example).
Now I want the model to be trained on those 5 actual real life data points instead of on (1005 rows i.e 1000 original plus 5 new).
Can this be done. Sorry for the "basic" question and all the help (including links if already answered) appreciated.
Code
import h2o
from h2o.automl import H2OAutoML
import pandas as pd
h2o.init()
data_path = "./df.csv"
df = h2o.import_file(data_path)
y = "c"
splits = df.split_frame(ratios = [0.8,0.19], seed = 1)
train = splits[0] #some part to train first
test = splits[1] # this is test set 1 (test later to become train set)
test2 = splits[2] # assume this to be the real world values
aml = H2OAutoML(max_runtime_secs=120,project_name='try', seed=1234)
aml.train(y = y, training_frame = train)
#First set of predictions
yy=aml.predict(test)
x=yy.as_data_frame(use_pandas=True) # predictions based on train set
#print them
print(x)
#the test set is now "new real world data"
#to be added as incremental training of the model
aml.train(y = y, training_frame = test)
#get the predictions again
yy=aml.predict(test2)
x=yy.as_data_frame(use_pandas=True)
print(x)
I have tried to retrain on the "new set" ofdata (assuming that is what line 30 does) but get rather curious numbers.

Related

Error: requires numeric/complex matrix/vector arguments for %*%; cross validating glmmTMB model

I am adapting some k-fold cross validation code written for glmer/merMod models to a glmmTMB model framework. All seems well until I try and use the output from the model(s) fit with training data to predict and exponentiate values into a matrix (to then break into quantiles/number of bins to assess predictive performance). I can get get this line to work using glmer models, but it seems when I run the same model using glmmTMB I get Error in model.matrix: requires numeric/complex matrix/vector arguments There are many other posts out there discussing this error code and I have tried converting the data frame into matrix form and changing the class of the covariates with no luck. Separately running the parts before and after the %*% works but when combined I get the error. For context, this code is intended to be run with use/availability data so the example variables may not make sense, but the problem gets shown well enough. Any suggestions as to what is going on?
library(lme4)
library(glmmTMB)
# Example with mtcars dataset
data(mtcars)
# Model both with glmmTMB and lme4
m1 <- glmmTMB(am ~ mpg + wt + (1|carb), family = poisson, data=mtcars)
m2 <- glmer(am ~ mpg + wt + (1|carb), family = poisson, data=mtcars)
#--- K-fold code (hashed out sections are original glmer version of code where different)---
# define variables
k <- 5
mod <- m1 #m2
dt <- model.frame(mod) #data used
reg.list <- list() # initialize object to store all models used for cross validation
# finds the name of the response variable in the model dataframe
resp <- as.character(attr(terms(mod), "variables"))[attr(terms(mod), "response") + 1]
# define column called sets and populates it with character "train"
dt$sets <- "train"
# randomly selects a proportion of the "used"/am records (i.e. am = 1) for testing data
dt$sets[sample(which(dt[, resp] == 1), sum(dt[, resp] == 1)/k)] <- "test"
# updates the original model using only the subset of "trained" data
reg <- glmmTMB(formula(mod), data = subset(dt, sets == "train"), family=poisson,
control = glmmTMBControl(optimizer = optim, optArgs=list(method="BFGS")))
#reg <- glmer(formula(mod), data = subset(dt, sets == "train"), family=poisson,
# control = glmerControl(optimizer = "bobyqa", optCtrl=list(maxfun=2e5)))
reg.list[[i]] <- reg # store models
# uses new model created with training data (i.e. reg) to predict and exponentiate values
predall <- exp(as.numeric(model.matrix(terms(reg), dt) %*% glmmTMB::fixef(reg)))
#predall <- exp(as.numeric(model.matrix(terms(reg), dt) %*% lme4::fixef(reg)))
Without looking at the code too carefully: glmmTMB::fixef(reg) returns a list (with elements cond (conditional model parameters), zi (zero-inflation parameters), disp (dispersion parameters) rather than a vector.
If you replace this bit with glmmTMB::fixef(reg)[["cond"]] it will probably work.

numpy delete isn't deleting full array of objs

I'm trying to split a dataset into train and test groups in Python using a method similar to what I'm used to in R (I realize there are other options). So I'm defining an array of row numbers that will make up my train set. I then want to grab the remaining row numbers for my test set using np.delete. Since there are 170 rows total and 136 go to the train set, the test set should have 34 rows. But it's got 80 -- the actual number varies when I change my random seed ... What have I got wrong here?
np.random.seed(222)
marriage = np.random.rand(170,55)
rows,cols = marriage.shape
sample = np.random.randint(0,rows-1,(round(.8*rows)))
train = marriage[sample,:]
test = np.delete(marriage, sample, axis=0)
print(marriage.shape)
print(len(sample))
print(train.shape)
print(test.shape)

use iris efficiently to combine iris.analysis&aggregated while outputing time of occurence

I am using python/iris to get annual extreme values from daily data. I use aggregated_by('season_year', iris.analysis.MIN) to get the extreme values, but I need to also know when in each year they occur. I have written the code below, but this is really slow, so I am wondering whether anyone knows maybe an iris build-in way to do it, or can otherwise think of another way that is more efficient?
Thank you!
#--- get daily data
cma = iris.load_cube('daily_data.nc')
#--- get annual extremes
c_metric = cma.aggregated_by('season_year', iris.analysis.MIN)
#--- add date of when the extremes are occurring
extrdateli=[]
#loop over all years
for mij in range(c_metric.data.shape[0]):
#
# get extreme value
m = c_metric.data[mij]
#
#get values for this year
cma_thisseasyr = cma.extract(iris.Constraint(season_year=lambda season_year:season_year==c_metric.coord('season_year').points[mij]))
#
#get date in data cube for when this extreme occurs and print add as string to a list
extradateli += [ str(c_metric.coord('season_year').points[mij])+':'+','.join([''.join(_) for _ in zip([str(_) for _ in cma_thisseasyr.coord('day').points[np.where(cma_thisseasyr.data==m)]], [str(_) for _ in cma_thisseasyr.coord('month').points[np.where(cma_thisseasyr.data==m)]], [str(_) for _ in cma_thisseasyr.coord('year').points[np.where(cma_thisseasyr.data==m)]])])]
#add this list to the metric cube as attribute
c_metric.attributes['date_of_extreme_value'] = ' '.join(extrdateli)
#--- save to file
iris.save('annual_min.nc')
I think the slow part is where you extract the values for each season year. You can speed this up a bit by dispensing with the lambda, i.e:
iris.Constraint(season_year=c_metric.coord('season_year').points[mij])
If this is still too slow, you could work directly on the numpy arrays in your cube. Slicing numpy arrays is much faster than extracting from cubes. For simplicity, the example below assumes you have a time coordinate.
import iris
import numpy as np
import iris.coord_categorisation as cat
#--- create a dummy data cube
ndays = 12 * 365 + 3 # 12 years of data
tcoord = iris.coords.DimCoord(range(ndays), units='days since 2001-02-01',
standard_name='time')
cma = iris.cube.Cube(np.random.normal(0, 1, ndays), long_name='blah')
cma.add_dim_coord(tcoord, 0)
cat.add_season_year(cma, 'time')
#--- get annual extremes
c_metric = cma.aggregated_by('season_year', iris.analysis.MIN)
#--- add date of when the extremes are occurring
extrdateli=[]
#loop over all years
for mij in range(c_metric.data.shape[0]):
#
#get extreme value
m = c_metric.data[mij]
#
#get values for this year
year_index = cma.coord('season_year').points == c_metric.coord('season_year').points[mij]
temperatures_this_syear = cma.data[year_index]
dates_this_syear = tcoord.units.num2date(tcoord.points[year_index])
#
#get date in data cube for when this extreme occurs and print add as string to a list
extreme_dates = dates_this_syear[temperatures_this_syear==m]
extrdateli += [ str(c_metric.coord('season_year').points[mij])+':'+','.join(str(date) for date in extreme_dates)]
#add this list to the metric cube as attribute
c_metric.attributes['date_of_extreme_value'] = ' '.join(extrdateli)

Supervised Machine Learning, producing a trained estimator

I have an assignment in which I am supposed to use scikit, numpy and pylab to do the following:
"All of the following should use data from the training_data.csv file
provided. training_data gives you a labeled set of integer pairs,
representing the scores of two sports teams, with the labels giving the
sport.
Write the following functions:
plot_scores() should draw a scatterplot of the data.
predict(dataset) should produce a trained Estimator to guess the sport
that resulted in a given score (from a dataset we've withheld, which will
be inputs as a 1000 x 2 np array). You can use any algorithm from scikit.
An optional additional function called "preprocess" will process dataset
before we it is passed to predict.
"
This is what I have done so far:
import numpy as np
import scipy as sp
import pylab as pl
from random import shuffle
def plot_scores():
k=open('training_data.csv')
lst=[]
for triple in k:
temp=triple.split(',')
lst.append([int(temp[0]), int(temp[1]), int(temp[2][:1])])
array=np.array(lst)
pl.scatter(array[:,0], array[:,1])
pl.show()
def preprocess(dataset):
k=open('training_data.csv')
lst=[]
for triple in k:
temp=triple.split(',')
lst.append([int(temp[0]), int(temp[1]), int(temp[2][:1])])
shuffle(lst)
return lst
In preprocess, I shuffled the data because I am supposed to use some of it to train on and some of it to test on, but the original data was not at all random. My question is, how am I supposed to "produce a trained estimator" in predict(dataset)? Is this supposed to be a function that returns another function? And which algorithm would be ideal to classify based on a dataset that looks like this:
The task likely wants you to train a standard scikit classifier model and return it, i.e. something like
from sklearn.svm import SVC
def predict(dataset):
X = ... # features, extract from dataset
y = ... # labels, extract from dataset
clf = SVC() # create classifier
clf.fit(X, y) # train
return clf
Though judging from the name of the function (predict) you should check if it really wants you to return a trained classifier or return predictions for the given dataset argument, as that would be more typical.
As a classifier you can basically use anyone that you like. Your plot looks like your dataset is linearly seperable (there are no colors for the classes, but I assume that the blops are the two classes). On linearly separable data hardly anything will fail. Try SVMs, logistic regression, random forests, naive bayes, ... For extra fun you can try to plot the decision boundaries, see here (which also contains an overview of the available classifiers).
I would recommend you to take a look at this structure:
from random import shuffle
import matplotlib.pyplot as plt
# import a classifier you need
def get_data():
# open your file and parse data to prepare X as a set of input vectors and Y as a set of targets
return X, Y
def split_data(X, Y):
size = len(X)
indices = range(size)
shuffle(indices)
train_indices = indices[:size/2]
test_indices = indices[size/2:]
X_train = [X[i] for i in train_indices]
Y_train = [Y[i] for i in train_indices]
X_test = [X[i] for i in test_indices]
Y_test = [Y[i] for i in test_indices]
return X_train, Y_train, X_test, Y_test
def plot_scatter(Y1, Y2):
plt.figure()
plt.scatter(Y1, Y2, 'bo')
plt.show()
# get data
X, Y = get_data()
# split data
X_train, Y_train, X_test, Y_test = split_data(X, Y)
# create a classifier as an object
classifier = YourImportedClassifier()
# train the classifier, after that the classifier is the trained estimator you need
classifier.train(X_train, Y_train) # or .fit(X_train, Y_train) or another train routine
# make a prediction
Y_prediction = classifier.predict(X_test)
# plot the scatter
plot_scatter(Y_prediction, Y_test)
I think what you are looking for is clf.fit() function, instead creating function that produce another function

Extrapolating variance components from Weir-Fst on Vcftools

vcftools --vcf ALL.chr1.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf --weir-fst-pop POP1.txt --weir-fst-pop POP2.txt --out fst.POP1.POP2
The above script computes Fst distances on 1000 Genomes population data using Weir and Cokerham's 1984 formula. This formula uses 3 variance components, namely a,b,c (between populations; between individuals within populations; between gametes within individuals within populations).
The output directly provides the result of the formula but not the components that the program calculated to arrive at the final result. How can I ask Vcftools to output the values for a,b,c?
If you can get the data into the format for hierfstat, you can get the variance components from varcomp.glob. What I normally do is:
use vcftools with --012 to get genotypes
convert 0/1/2/-1 to hierfstat format (eg., 11/12/22/NA)
load the data into hierfstat and compute (see below)
R example:
library(hierfstat)
data = read.table("hierfstat.txt", header=T, sep="\t")
levels = data.frame(data$popid)
loci = data[,2:ncol(data)]
res = varcomp.glob(levels=levels, loci=loci, diploid=T)
print(res$loc)
print(res$F)
Fst for each locus (row) therefore is (without hierarchical design), from res$loc: res$loc[1]/sum(res$loc). If you have more complicated sampling, you'll need to interpret the variance components differently.
--update per your comment--
I do this in Pandas, but any language would do. It's a text replacement exercise. Just get your .012 file into a dataframe and convert as below. I read in row by row into numpy b/c I have tons of snps, but read_csv would work, too.
import pandas as pd
import numpy as np
z12_data = []
for i, line in enumerate(open(z12_file)):
line = line.strip()
line = [int(x) for x in line.split("\t")]
z12_data.append(np.array(line))
if i % 10 == 0:
print i
z12_data = np.array(z12_data)
z12_df = pd.DataFrame(z12_data)
z12_df = z12_df.drop(0, axis=1)
z12_df.columns = pd.Series(z12_df.columns)-1
hierf_trans = {0:11, 1:12, 2:22, -1:'NA'}
def apply_hierf_trans(series):
return [hierf_trans[x] if x in hierf_trans else x for x in series]
hierf = df.apply(apply_hierf_trans)
hierf.to_csv("hierfstat.txt", header=True, index=False, sep="\t")
Then, you'd read that file hierfstat.txt into R, these are your loci. You'd need to specify your levels in your sampling design (e.g., your population). Then call varcomp.glob() to get the variance components. I have a parallel version of this here if you want to use it.
Note that you are specifying 0 as the reference allele, in this case. May be what you want, maybe not. I often calculate minor allele frequency and make 2 the minor allele, but it depends on your study goal.

Resources