PYOMO LP, combination set - set

I would like your help with PYOMO LP. I am not sure what I am doing wrong so any feedback would be helpfull:
My data set is like bellow:
#The demand for each order
demand= {782912: 808, 782913: 3188, 782914: 2331, 782915: 847, 782916: 2163,789954:5643}
#The cost per unit produced for each order based in which factory chosen
total cost= { (782912, 'PLANT16'): 0.46, (782913, 'PLANT16'): 0.46, (782914, 'PLANT16'): 0.46, (782915, 'PLANT16'): 0.46, (782916, 'PLANT16'): 0.46}, (789954,'PLANT05'):0.90,(789954,'PLANT07'):0.91,(789954,'PLANT08'):1.13,(789954,'PLANT10'):0.12}
#The capacity of each factory
supply= {'PLANT05': 531,'PLANT07': 841,'PLANT08': 1107,'PLANT10': 981,'PLANT16': 2313}
#defining the model
model.j=pyo.Set(initialize=supply.keys()) #factories
model.select_combos = pyo.Set(within = model.i * model.j, initialize = total_costs_per_unit.keys())
model.p=pyo.Param(model.i,model.j,initialize=total_costs_per_unit) # here goes the cost dictionary
#Decision variable
#Objective function
def obj_rul(model):
return sum(p[i,j]*x[i,j] for i,j in model.select_combos)
#warning , not all combinations of i,j exist in my model.p, as they would not be valid solutions for the problem
def Const1(model,i):
return sum(x[i,j] for j in model.j)>=d[i]
def Const5(model,j):
return sum(x[i,j] for i in model.i)<=s[j]
print('Obj funct=',model.Obj())
for i in model.i:
for j in model.j:
print('for order',i," from plant ",j, " is sent ",x[i,j]())
The error message I am getting is :
ERROR:pyomo.core:evaluating object as numeric value: x[782912,PLANT16]
(object: <class 'pyomo.core.base.var._GeneralVarData'>)
No value for uninitialized NumericValue object x[782912,PLANT16]
All in all, I think if I had all combinations of 'orders' & 'plants', it would not had an issue. As it is now I dont know how to solve this problem.

You are on the right track here and your suspicion is correct.
You are getting the error because some combinations of i, j are not used in your model at all, so they never become initialized, and when you try to print their value, you have problems. There are other issues in your model as well dealing with the same thing.
So you have several options here.
You have an "implied" constraint here that if you don't have a production cost for something, then it cannot be produced at that factory, so you could add to your data, and put in "production caps" of some kind for each item-factory combo... not real fun.
Or you could put in a very large cost for the production of the "illegal" items so that the solver doesn't pick them.
But you already did part of the work to what I think is the best solution, which is making a set of "legal combinations" of i, j. You just didn't bring it to the finish line. You can/should use this subset as the basis for the things that naturally comply with it like x and p. See my code. You just need to be careful for the sums that you only use i, j combos that are legal, which is pretty easy to do by screening them as I show in your updated constraints.
Note: I multiplied all of your capacities by 10x because the initial model wasn't feasible. It is, this runs, and you should be able to modify this easy enough as needed.
import pyomo.environ as pyo
#The demand for each order
demand= {782912: 808, 782913: 3188, 782914: 2331, 782915: 847, 782916: 2163,789954:5643}
#The cost per unit produced for each order based in which factory chosen
production_cost= { (782912, 'PLANT16'): 0.46,
(782913, 'PLANT16'): 0.46,
(782914, 'PLANT16'): 0.46,
(782915, 'PLANT16'): 0.46,
(782916, 'PLANT16'): 0.46,
(789954, 'PLANT05'): 0.90,
(789954, 'PLANT07'): 0.91,
(789954, 'PLANT08'): 1.13,
(789954, 'PLANT10'): 0.12}
#The capacity of each factory
supply= {'PLANT05': 5310,'PLANT07': 8410,'PLANT08': 11070,'PLANT10': 9810,'PLANT16': 23130}
model.i=pyo.Set(initialize=demand.keys()) #orders
model.j=pyo.Set(initialize=supply.keys()) #factories
model.select_combos = pyo.Set(within = model.i * model.j, initialize = production_cost.keys())
model.p=pyo.Param(model.select_combos, initialize=production_cost) # here goes the cost dictionary
# p=model.p <-- I wouldn't rename things...confusing
# d=model.d
# s=model.s
#Decision variable
model.x=pyo.Var(model.select_combos, within=pyo.NonNegativeReals)
# x=model.x
#Objective function
# def obj_rul(model):
# return sum(p[i,j]*x[i,j] for i,j in model.select_combos)
#warning , not all combinations of i,j exist in my model.p, as they would not be valid solutions for the problem
# your objective is a simple sum() expression, so you don't need the complication of a rule-function combo...
model.Obj=pyo.Objective(expr=sum(model.p[i,j]*model.x[i,j] for i,j in model.select_combos),sense=pyo.minimize)
def Const1(model,i):
return sum(model.x[i,j] for j in model.j if (i,j) in model.select_combos)>=model.d[i]
def Const5(model,j):
return sum(model.x[i,j] for i in model.i if (i,j) in model.select_combos)<=model.s[j]
#print('Obj funct=',model.Obj())
for i, j in model.select_combos:
print('for order',i," from plant ",j, " is sent ", pyo.value(model.x[i,j]))


How to update training dataset at epoch begin in Huggingface Trainer using Callback?

I want to recreate the training dataset by a function generate_custom_train_set at the beginning of every epoch, however, is there a way I could do it with Trainer using callback?
My trainer looks like
trainer = Trainer(
I'm having the same question as I try to implement Examples-proportional mixing from the T5 paper. I didn't find support from hugging face.
My current solution is to modify the trainer.train_dataset in the on_epoch_begin callback.
Here's an implementation. I'm using this in my own project. Seems to work.
First, implement your per-epoch change in your Dataset, in my case, it's the sample function for Examples-Proportional Mixing.
class ProportionMixingDataset:
Examples-proportional mixing from T5
TODO: failed to find a pytorch working implementation
Equivalent to, for the larger datasets, a new subset is taken at each epoch,
then sample in the joined subset once
def __init__(self, dataset_list: List[Dataset] = None, k: int = None):
:param dataset_list: Ordered list of datasets
:param k: Artificial limit
self.dsets = dataset_list
assert k is not None
self.k = k
self.dset_szs = [min(len(d), k) for d in self.dsets] = sum(self.dset_szs)
self._sampled_idxs: List[Optional[torch.Tensor]] = [None] * len(self.dsets)
def sample(self):
Sub-sample datasets larger than k
Intended to call in each epoch
for i, dset in enumerate(self.dsets):
sz = len(dset)
if sz > self.k:
self._sampled_idxs[i] = torch.randperm(sz)[:self.k]
def __len__(self):
def _idx2dset_idx(self, idx: int) -> Tuple[int, int]:
Convert a global index to a dataset index
for i, sz in enumerate(self.dset_szs):
if idx < sz:
return i, idx
idx -= sz
raise ValueError('Should not happen')
def __getitem__(self, idx):
if not isinstance(idx, int):
raise ValueError('Batched indexing not supported')
idx_dset, idx = self._idx2dset_idx(idx)
dset = self.dsets[idx_dset]
if self._sampled_idxs[idx_dset] is not None: # A sub-sample index
idx = self._sampled_idxs[idx_dset][idx].item()
return dset[idx]
Then pass that dataset to Trainer.
Now comes the magic part:
class ProportionalMixCallback(TrainerCallback):
Trigger re-computing subset for dataset Examples-proportional mixing, see `dataset::ProportionMixingDataset`
A hack that modifies the train dataset, pointed by Trainer's dataloader
def __init__(self, trainer: Trainer):
self.trainer = trainer
def on_epoch_begin(self, args: TrainingArguments, state, control, **kwargs):
Pass this to your trainer as a callback.
This triggers the sample call which modifies the dataset at the times we need it.
This works becasue train_dataLoader in trainer still points to the same train dataset object.

TensorFlow - directly calling tf.function much faster than calling tf.function returned from wrapper

I am training a VAE (using federated learning, but that is not so important) and wanted to keep the loss and train functions simple to exchange. The initial approach was to have a tf.function as loss function and a tf.function as train function as follows:
def kl_reconstruction_loss(model, model_input, beta):
x, y = model_input
mean, logvar = model.encode(x, y)
z = model.reparameterize(mean, logvar)
x_logit = model.decode(z, y)
cross_ent = tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=x)
reconstruction_loss = tf.reduce_mean(tf.reduce_sum(cross_ent, axis=[1, 2, 3]), axis=0)
kl_loss = tf.reduce_mean(0.5 * tf.reduce_sum(tf.exp(logvar) + tf.square(mean) - 1. - logvar, axis=-1), axis=0)
loss = reconstruction_loss + beta * kl_loss
return loss, kl_loss, reconstruction_loss
def train_fn(model: tf.keras.Model, batch, optimizer, kl_beta):
"""Trains the model on a single batch.
model: The VAE model.
batch: A batch of inputs [images, labels] for the vae.
optimizer: The optimizer to train the model.
beta: Weighting of KL loss
The loss.
def vae_loss():
"""Does the forward pass and computes losses for the generator."""
# N.B. The complete pass must be inside loss() for gradient tracing.
return kl_reconstruction_loss(model, batch, kl_beta)
with tf.GradientTape() as tape:
loss, kl_loss, rc_loss = vae_loss()
grads = tape.gradient(loss, model.trainable_variables)
grads_and_vars = zip(grads, model.trainable_variables)
return loss
For my dataset this results in an epoch duration of approx. 25 seconds. However, since I have to call those functions directly in my code, I would have to enter different ones if I would want to try out different loss/train functions.
So, alternatively, I followed and wrapped the loss function in a class and the train function in another function. Now I have:
class VaeKlReconstructionLossFns(AbstractVaeLossFns):
def vae_loss(self, model, model_input, labels, global_round):
# KL Reconstruction loss
mean, logvar = model.encode(model_input, labels)
z = model.reparameterize(mean, logvar)
x_logit = model.decode(z, labels)
cross_ent = tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=model_input)
reconstruction_loss = tf.reduce_mean(tf.reduce_sum(cross_ent, axis=[1, 2, 3]), axis=0)
kl_loss = tf.reduce_mean(0.5 * tf.reduce_sum(tf.exp(logvar) + tf.square(mean) - 1. - logvar, axis=-1), axis=0)
loss = reconstruction_loss + self._get_beta(global_round) * kl_loss
if model.losses:
loss += tf.add_n(model.losses)
return loss, kl_loss, reconstruction_loss
def create_train_vae_fn(
vae_loss_fns: vae_losses.AbstractVaeLossFns,
vae_optimizer: tf.keras.optimizers.Optimizer):
"""Create a function that trains VAE, binding loss and optimizer.
vae_loss_fns: Instance of gan_losses.AbstractVAELossFns interface,
specifying the VAE training loss.
vae_optimizer: Optimizer for training the VAE.
Function that executes one step of VAE training.
# We check that the optimizer has not been used previously, which ensures
# that when it is bound the train fn isn't holding onto a different copy of
# the optimizer variables then the copy that is being exchanged b/w server and
# clients.
if vae_optimizer.variables():
raise ValueError(
'Expected vae_optimizer to not have been used previously, but '
'variables were already initialized.')
def train_vae_fn(model: tf.keras.Model,
"""Trains the model on a single batch.
model: The VAE model.
model_inputs: A batch of inputs (usually images) for the VAE.
labels: A batch of labels corresponding to the inputs.
global_round: The current glob al FL round for beta calculation
new_optimizer_state: A possible optimizer state to overwrite the current one with.
The number of examples trained on.
The loss.
The updated optimizer state.
def vae_loss():
"""Does the forward pass and computes losses for the generator."""
# N.B. The complete pass must be inside loss() for gradient tracing.
return vae_loss_fns.vae_loss(model, model_inputs, labels, global_round)
# Set optimizer vars
optimizer_state = get_optimizer_state(vae_optimizer)
if new_optimizer_state is not None:
# if optimizer is uninitialised, initialise vars
tf.nest.assert_same_structure(optimizer_state, new_optimizer_state)
except ValueError:
initialize_optimizer_vars(vae_optimizer, model)
optimizer_state = get_optimizer_state(vae_optimizer)
tf.nest.assert_same_structure(optimizer_state, new_optimizer_state)
tf.nest.map_structure(lambda a, b: a.assign(b), optimizer_state, new_optimizer_state)
with tf.GradientTape() as tape:
loss, kl_loss, rc_loss = vae_loss()
grads = tape.gradient(loss, model.trainable_variables)
grads_and_vars = zip(grads, model.trainable_variables)
return tf.shape(model_inputs)[0], loss, optimizer_state
return train_vae_fn
This new formulation takes about 86 seconds per epoch.
I am struggling to understand why the second version performs so much worse than the first one. Does anyone have a good explanation for this?
Thanks in advance!
EDIT: My Tensorflow version is 2.5.0

How to add a maximum travel time duration for the sum of all routes in VRP Google OR-TOOLS

I am new to programming and used Google OR-tools to create my VRP model. In my current model, I have included a general time window and capacity constraint per vehicle, creating a capacitated vehicle routing problem with time windows. I followed the OR-tools guides which contains a maximum travel duration for each vehicle.
However, I want to include a maximum travel duration for the sum of all routes, whereas the maximum travel duration for each vehicle does not matter (so I set it to 100.000). Accorddingly, I want to create something in the model/solution printer that tells me which amount of addresses could not be visited due to the constraint on the maximum travel duration for the sum of all routes. From the examples I have seen I think it would be kind of easy, but my knowledge on programming is fairly limited, so my attempts had no succes. Can anyone help me?
import pandas as pd
import openpyxl
import numpy as np
import math
from random import sample
from ortools.constraint_solver import routing_enums_pb2
from ortools.constraint_solver import pywrapcp
from scipy.spatial.distance import squareform, pdist
from haversine import haversine
#STEP - create data
# import/read excel file
data = pd.read_excel(r'C:\Users\Jean-Paul\Documents\Thesis\OR TOOLS\Data.xlsx', engine = 'openpyxl')
df = pd.DataFrame(data, columns= ['number','lat','lng']) # create dataframe with 10805 addresses + address of the depot
#print (df)
# randomly sample X addresses from the dataframe and their corresponding number/latitude/longtitude
df_sample = df.sample(n=100)
#print (df_data)
# read first row of the excel file (= coordinates of the depot)
df_depot = pd.DataFrame(data, columns= ['number','lat','lng']).iloc[0:1]
#print (df_depot)
# combine dataframe of depot and sample into one dataframe
df_data = pd.concat([df_depot, df_sample], ignore_index=True, sort=False)
#print (df_data)
#STEP - create distance matrix data
# determine distance between latitude and longtitude
df_data.set_index('number', inplace=True)
matrix_distance = pd.DataFrame(squareform(pdist(df_data, metric=haversine)), index=df_data.index, columns=df_data.index)
matrix_list = np.array(matrix_distance)
#print (matrix_distance) # create table of distances between addresses including headers
#print (matrix_list) # converting table to list of lists and exclude headers
#STEP - create time matrix data
travel_time = matrix_list / 15 * 60 # divide distance by travel speed 20 km/h and multiply by 60 minutes
#print (travel_time) # converting distance matrix to travel time matrix
#STEP - create time window data
# create list for each sample - couriers have to visit this address within 0-X minutes of time using a list of lists
window_range = []
for i in range(len(df_data)):
list = [0, 240]
window_range.append(list) # create list of list with a time window range for each address
#print (window_range)
#STEP - create demand data
# create list for each sample - all addresses demand 1 parcel except the depot
demand_range = []
for i in range(len(df_data.iloc[0:1])):
list = 0
for j in range(len(df_data.iloc[1:])):
list2 = 1
#print (demand_range)
#STEP - create fleet size data # amount of vehicles in the fleet
fleet_size = 6
#print (fleet_size)
#STEP - create capacity data for each vehicle
fleet_capacity = []
for i in range(fleet_size): # capacity per vehicle
list = 20
#print (fleet_capacity)
#STEP - create data model that stores all data for the problem
def create_data_model():
data = {}
data['time_matrix'] = travel_time
data['time_windows'] = window_range
data['num_vehicles'] = fleet_size
data['depot'] = 0 # index of the depot
data['demands'] = demand_range
data['vehicle_capacities'] = fleet_capacity
return data
#STEP - creating the solution printer
def print_solution(data, manager, routing, solution):
"""Prints solution on console."""
print(f'Objective: {solution.ObjectiveValue()}')
time_dimension = routing.GetDimensionOrDie('Time')
total_time = 0
for vehicle_id in range(data['num_vehicles']):
index = routing.Start(vehicle_id)
plan_output = 'Route for vehicle {}:\n'.format(vehicle_id)
while not routing.IsEnd(index):
time_var = time_dimension.CumulVar(index)
plan_output += '{0} Time({1},{2}) -> '.format(
manager.IndexToNode(index), solution.Min(time_var),
index = solution.Value(routing.NextVar(index))
time_var = time_dimension.CumulVar(index)
plan_output += '{0} Time({1},{2})\n'.format(manager.IndexToNode(index),
plan_output += 'Time of the route: {}min\n'.format(
total_time += solution.Min(time_var)
print('Total time of all routes: {}min'.format(total_time))
#STEP - create the VRP solver
def main():
# instantiate the data problem
data = create_data_model()
# create the routing index manager
manager = pywrapcp.RoutingIndexManager(len(data['time_matrix']),
data['num_vehicles'], data['depot'])
# create routing model
routing = pywrapcp.RoutingModel(manager)
#STEP - create demand callback and dimension for capacity
# create and register a transit callback
def demand_callback(from_index):
"""Returns the demand of the node."""
# convert from routing variable Index to demands NodeIndex
from_node = manager.IndexToNode(from_index)
return data['demands'][from_node]
demand_callback_index = routing.RegisterUnaryTransitCallback(
0, # null capacity slack
data['vehicle_capacities'], # vehicle maximum capacities
True, # start cumul to zero
#STEP - create time callback
# create and register a transit callback
def time_callback(from_index, to_index):
"""Returns the travel time between the two nodes."""
# convert from routing variable Index to time matrix NodeIndex
from_node = manager.IndexToNode(from_index)
to_node = manager.IndexToNode(to_index)
return data['time_matrix'][from_node][to_node]
transit_callback_index = routing.RegisterTransitCallback(time_callback)
# define cost of each Arc (costs in terms of travel time)
# STEP - create a dimension for the travel time (TIMEWINDOW) - dimension keeps track of quantities that accumulate over a vehicles route
# add time windows constraint
time = 'Time'
2, # allow waiting time (does not have an influence in this model)
100000, # maximum total route lenght in minutes per vehicle (does not have an influence because of capacity constraint)
False, # do not force start cumul to zero
time_dimension = routing.GetDimensionOrDie(time)
# add time window constraints for each location except depot
for location_idx, time_window in enumerate(data['time_windows']):
if location_idx == data['depot']:
index = manager.NodeToIndex(location_idx)
time_dimension.CumulVar(index).SetRange(time_window[0], time_window[1])
# add time window constraint for each vehicle start node
depot_idx = data['depot']
for vehicle_id in range(data['num_vehicles']):
index = routing.Start(vehicle_id)
#STEP - instantiate route start and end times to produce feasible times
for i in range(data['num_vehicles']):
#STEP - setting default search parameters and a heuristic method for finding the first solution
search_parameters = pywrapcp.DefaultRoutingSearchParameters()
search_parameters.first_solution_strategy = (
#STEP - solve the problem with the serach parameters and print solution
solution = routing.SolveWithParameters(search_parameters)
if solution:
print_solution(data, manager, routing, solution)
if __name__ == '__main__':
See #Mizux's answer, going under-the-hood in the solver to make a summation cost over all vehicle route lengths:

Converge on Best Combination of Elements

You have $10,000 to invest in stocks. You are given a list of 200 stocks, and are told to select 8 of those stocks to buy, and also indicate how many of those stocks you want to buy. You cannot spend more than $2,500 on a single stock alone, and each stock has its own price ranging from $100 to $1000. You cannot buy a fraction of a stock, only whole numbers. Each stock also has a value attached to it indicating how profitable it is. This is an arbitrary number from 0-100 that serves as a simple rating system.
The end goal is to list the optimal selection of 8 stocks, and indicate the best quantity of each of those stocks to buy without going over the $2,500 limit for each stock.
• I'm not asking for investment advice, I chose stocks because it acts as a good metaphor for the actual problem I'm trying to solve.
• Seems like what I'm looking at is a more complex version of the 0/1 Knapsack problem:
• No, this isn't homework.
Here is lightly tested code for solving your problem exactly in time that is polynomial in the amount of money available, the number of stocks that you have, and the maximum amount of stock that you can buy.
#! /usr/bin/env python
from collections import namedtuple
Stock = namedtuple('Stock', ['id', 'price', 'profit'])
def optimize (stocks, money=10000, max_stocks=8, max_per_stock=2500):
Investment = namedtuple('investment', ['profit', 'stock', 'quantity', 'previous_investment'])
investment_transitions = []
last_investments = {money: Investment(0, None, None, None)}
for _ in range(max_stocks):
next_investments = {}
investment_transitions.append([last_investments, next_investments])
last_investments = next_investments
def prioritize(stock):
# This puts the best profit/price, as a ratio, first.
val = [-(stock.profit + 0.0)/stock.price, stock.price,]
return val
for stock in sorted(stocks, key=prioritize):
# We reverse transitions so we have not yet added the stock to the
# old investments when we add it to the new investments.
for transition in reversed(investment_transitions):
old_t = transition[0]
new_t = transition[1]
for avail, invest in old_t.iteritems():
for i in range(int(min(avail, max_per_stock)/stock.price)):
quantity = i+1
new_avail = avail - quantity*stock.price
new_profit = invest.profit + quantity*stock.profit
if new_avail not in new_t or new_t[new_avail].profit < new_profit:
new_t[new_avail] = Investment(new_profit, stock, quantity, invest)
best_investment = investment_transitions[0][0][money]
for transition in investment_transitions:
for invest in transition[1].values():
if best_investment.profit < invest.profit:
best_investment = invest
purchase = {}
while best_investment.stock is not None:
purchase[best_investment.stock] = best_investment.quantity
best_investment = best_investment.previous_investment
return purchase
optimize([Stock('A', 100, 10), Stock('B', 1040, 160)])
And here it is with the tiny optimization of deleting investments once we see that continuing to add stocks to it cannot improve. This will probably run orders of magnitude faster than the old code with your data.
#! /usr/bin/env python
from collections import namedtuple
Stock = namedtuple('Stock', ['id', 'price', 'profit'])
def optimize (stocks, money=10000, max_stocks=8, max_per_stock=2500):
Investment = namedtuple('investment', ['profit', 'stock', 'quantity', 'previous_investment'])
investment_transitions = []
last_investments = {money: Investment(0, None, None, None)}
for _ in range(max_stocks):
next_investments = {}
investment_transitions.append([last_investments, next_investments])
last_investments = next_investments
def prioritize(stock):
# This puts the best profit/price, as a ratio, first.
val = [-(stock.profit + 0.0)/stock.price, stock.price,]
return val
best_investment = investment_transitions[0][0][money]
for stock in sorted(stocks, key=prioritize):
profit_ratio = (stock.profit + 0.0) / stock.price
# We reverse transitions so we have not yet added the stock to the
# old investments when we add it to the new investments.
for transition in reversed(investment_transitions):
old_t = transition[0]
new_t = transition[1]
for avail, invest in old_t.items():
if avail * profit_ratio + invest.profit <= best_investment.profit:
# We cannot possibly improve with this or any other stock.
del old_t[avail]
for i in range(int(min(avail, max_per_stock)/stock.price)):
quantity = i+1
new_avail = avail - quantity*stock.price
new_profit = invest.profit + quantity*stock.profit
if new_avail not in new_t or new_t[new_avail].profit < new_profit:
new_invest = Investment(new_profit, stock, quantity, invest)
new_t[new_avail] = new_invest
if best_investment.profit < new_invest.profit:
best_investment = new_invest
purchase = {}
while best_investment.stock is not None:
purchase[best_investment.stock] = best_investment.quantity
best_investment = best_investment.previous_investment
return purchase

Error in setting max features parameter in Isolation Forest algorithm using sklearn

I'm trying to train a dataset with 357 features using Isolation Forest sklearn implementation. I can successfully train and get results when the max features variable is set to 1.0 (the default value).
However when max features is set to 2, it gives the following error:
ValueError: Number of features of the model must match the input.
Model n_features is 2 and input n_features is 357
It also gives the same error when the feature count is 1 (int) and not 1.0 (float).
How I understood was that when the feature count is 2 (int), two features should be considered in creating each tree. Is this wrong? How can I change the max features parameter?
The code is as follows:
from sklearn.ensemble.iforest import IsolationForest
def isolation_forest_imp(dataset):
estimators = 10
samples = 100
features = 2
contamination = 0.1
bootstrap = False
random_state = None
verbosity = 0
estimator = IsolationForest(n_estimators=estimators, max_samples=samples, contamination=contamination,
bootstrap=boostrap, random_state=random_state, verbose=verbosity)
model =
In the documentation it states:
max_features : int or float, optional (default=1.0)
The number of features to draw from X to train each base estimator.
- If int, then draw `max_features` features.
- If float, then draw `max_features * X.shape[1]` features.
So, 2 should mean take two features and 1.0 should mean take all of the features, 0.5 take half and so on, from what I understand.
I think this could be a bug, since, taking a look in IsolationForest's fit:
# Isolation Forest inherits from BaseBagging
# and when _fit is called, BaseBagging takes care of the features correctly
super(IsolationForest, self)._fit(X, y, max_samples,
# however, when after _fit the decision_function is called using X - the whole sample - not taking into account the max_features
self.threshold_ = -sp.stats.scoreatpercentile(
-self.decision_function(X), 100. * (1. - self.contamination))
# when the decision function _validate_X_predict is called, with X unmodified,
# it calls the base estimator's (dt) _validate_X_predict with the whole X
X = self.estimators_[0]._validate_X_predict(X, check_input=True)
# from
def _validate_X_predict(self, X, check_input):
"""Validate X whenever one tries to predict, apply, predict_proba"""
if self.tree_ is None:
raise NotFittedError("Estimator not fitted, "
"call `fit` before exploiting the model.")
if check_input:
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
if issparse(X) and (X.indices.dtype != np.intc or
X.indptr.dtype != np.intc):
raise ValueError("No support for np.int64 index based "
"sparse matrices")
# so, this check fails because X is the original X, not with the max_features applied
n_features = X.shape[1]
if self.n_features_ != n_features:
raise ValueError("Number of features of the model must "
"match the input. Model n_features is %s and "
"input n_features is %s "
% (self.n_features_, n_features))
return X
So, I am not sure on how you can handle this. Maybe figure out the percentage that leads to just the two features you need - even though I am not sure it'll work as expected.
Note: I am using scikit-learn v.0.18
Edit: as #Vivek Kumar commented this is an issue and upgrading to 0.20 should do the trick.
