LightGBM 'Using categorical_feature in Dataset.' Warning? - lightgbm

From my reading of the LightGBM document, one is supposed to define categorical features in the Dataset method. So I have the following code:
cats=['C1', 'C2']
d_train = lgb.Dataset(X, label=y, categorical_feature=cats)
However, I received the following error message:
/app/anaconda3/anaconda3/lib/python3.7/site-packages/lightgbm/basic.py:1243: UserWarning: Using categorical_feature in Dataset.
warnings.warn('Using categorical_feature in Dataset.')
Why did I get the warning message?

I presume that you get this warning in a call to lgb.train. This function also has argument categorical_feature, and its default value is 'auto', which means taking categorical columns from pandas.DataFrame (documentation). The warning, which is emitted at this line, indicates that, despite lgb.train has requested that categorical features be identified automatically, LightGBM will use the features specified in the dataset instead.
To avoid the warning, you can give the same argument categorical_feature to both lgb.Dataset and lgb.train. Alternatively, you can construct the dataset with categorical_feature=None and only specify the categorical features in lgb.train.

Like user andrey-popov described you can use the lgb.train's categorical_feature parameter to get rid of this warning.
Below is a simple example with some code how you could do it:
# Define categorical features
cat_feats = ['item_id', 'dept_id', 'store_id',
'cat_id', 'state_id', 'event_name_1',
'event_type_1', 'event_name_2', 'event_type_2']
...
# Define the datasets with the categorical_feature parameter
train_data = lgb.Dataset(X.loc[train_idx],
Y.loc[train_idx],
categorical_feature=cat_feats,
free_raw_data=False)
valid_data = lgb.Dataset(X.loc[valid_idx],
Y.loc[valid_idx],
categorical_feature=cat_feats,
free_raw_data=False)
# And train using the categorical_feature parameter
lgb.train(lgb_params,
train_data,
valid_sets=[valid_data],
verbose_eval=20,
categorical_feature=cat_feats,
num_boost_round=1200)

This is less of an answer to the original OP and more of an answer to people who are using sklearn API and encounter this issue.
For those of you who are using sklearn API, especially using one of the cross_val methods from sklearn, there are two solutions you could consider using.
Sklearn API solution
A solution that worked for me was to cast categorical fields into the category datatype in pandas.
If you are using pandas df, LightGBM should automatically treat those as categorical. From the documentation:
integer codes will be extracted from pandas categoricals in the
Python-package
It would make sense for this to be the equivalent in the sklearn API to setting categoricals in the Dataset object.
But keep in mind that LightGBM does not officially support virtually any of the non-core parameters for sklearn API, and they say so explicitly:
**kwargs is not supported in sklearn, it may cause unexpected issues.
Adaptive Solution
The other, more sure-fire solution to being able to use methods like cross_val_predict and such is to just create your own wrapper class that implements the core Dataset/Train under the hood but exposes a fit/predict interface for the cv methods to latch onto. That way you get the full functionality of lightGBM with only a little bit of rolling your own code.
The below sketches out what this could look like.
class LGBMSKLWrapper:
def __init__(self, categorical_variables, params):
self.categorical_variables = categorical_variables
self.params = params
self.model = None
def fit(self, X, y):
my_dataset = ltb.Dataset(X, y, categorical_feature=self.categorical_variables)
self.model = ltb.train(params=self.params, train_set=my_dataset)
def predict(self, X):
return self.model.predict(X)
The above lets you load up your parameters when you create the object, and then passes that onto train when the client calls fit.

Related

Do AutoMl predictions not work when uploaded into Google Cloud Functions

Im writing code that makes a prediction based on a trained AutoMl multi-label Classifier. The function works if I run it locally, however, as soon as i upload the same code to Cloud Functions on GCP (a process that i know usually works) it provides me with this error
TypeError: predict() takes from 1 to 2 positional arguments but 4 were given
Here is a sample of my code, taken straight from the AutoMl documentation with some slight adjustments.
def get_sentiment(content):
"""
Returns a google cloud platform payload class containing the sentiment score given by our NLP sentiment analyser.
:param content: STRING (UTF-8 encoded, ASCII)
:return: <class 'google.cloud.automl.types.PredictResponse'>
"""
options = ClientOptions(api_endpoint='automl.googleapis.com')
prediction_client = automl_v1beta1.PredictionServiceClient(client_options=options)
name = model_sentiment
payload = {'text_snippet': {'content': content, 'mime_type': 'text/plain'}}
params = {}
request = prediction_client.predict(name, payload, params)
return request
I have tried removing the params variable from prediction and replacing payload with content the only change is that I get the error:
TypeError: predict() takes from 1 to 2 positional arguments but 3 were given
Additionally, I have replaced automl_v1beta1 with automl and automl_v1. and again while both work locally they do not work on Google Cloud.
Thank you for any advice or help
Update, Apparently there are some bugs in the latest version of AutoML and the error was fixed by running the code on a previous version of it. Specifically in my case v0.9.0

How to initialize PYFMI models in parallel?

I am using pyfmi to do simulations with EnergyPlus. I recognized that initializing the individual EnergyPlus models takes quite some time. Therefore, I hope to find a way to initialize the models in parallel. I tried the python library multiprocessing with no success. If it matters, I am on Ubuntu 16.10 and use Python 3.6.
Here is what I want to get done in serial:
fmus = {}
for id in id_list:
chdir(fmu_path+str(id))
fmus[id] = load_fmu('f_' + str(id)+'.fmu',fmu_path+str(id))
fmus[id].initialize(start_time,final_time)
The result is a dictionary with ids as key and the models as value: {id1:FMUModelCS1,id2:FMUModelCS1}
The purpose is to call later the models by their key and do simulations.
Here is my attempt with multiprocessing:
def ep_intialization(id,start_time,final_time):
chdir(fmu_path+str(id))
model = load_fmu('f_' + str(id)+'.fmu',fmu_path+str(id))
model.initialize(start_time,final_time)
return {id:model}
data = ((id,start_time,final_time) for id in id_list)
if __name__ == '__main__':
pool = Pool(processes=cpus)
pool.starmap(ep_intialization, data)
pool.close()
pool.join()
I can see the processes of the models in my system monitor but then the script raise an error because the models are not pickable:
MaybeEncodingError: Error sending result: '[{id2: <pyfmi.fmi.FMUModelCS1 object at 0x561eaf851188>}]'. Reason: 'TypeError('self._fmu,self.callBackFunctions,self.callbacks,self.context,self.variable_list cannot be converted to a Python object for pickling',)'
But I cannot imagine that there is no way to initialize the models in parallel. Other frameworks/libraries than threading/multiprocessing are also welcome.
I saw this answer but it seems that it focuses on the simulations after initialization.
The answer below the one you refer to seems to explain what the problem with multiprocessing and FMU instantiation is.
I tried with pathos suggested in this answer, but run into the same problem:
from pyfmi import load_fmu
from multiprocessing import Pool
from os import chdir
from pathos.multiprocessing import Pool
def ep_intialization(id):
chdir('folder' + str(id))
model = load_fmu('BouncingBall.fmu')
model.initialize(0,10)
return {id:model}
id_list = [1,2]
cpus = 2
data = ((id) for id in id_list)
pool = Pool(cpus)
out = pool.map(ep_intialization, data)
This gives:
MaybeEncodingError: Error sending result: '[{1: <pyfmi.fmi.FMUModelME2 object at 0x564e0c529290>}]'. Reason: 'TypeError('self._context,self._fmu,self.callBackFunctions,self.callbacks cannot be converted to a Python object for pickling',)'
Here is another idea:
I suppose the instantiation is slow because EnergyPlus links plenty of libraries into the FMU. If the components you are modelling all have the same interface (input, output, parameters), you can probably use a single FMU with an additional parameter that switches between the models.
This would be much more efficient: You would only have to instantiate a single FMU and could call it in parallel with different parameters and inputs.
Example:
I have never worked with EnergyPlus, but maybe the following example will illustrate the approach:
You have three variants of a building and you are merely interested in the total heat flux over the entire surface area of buildings as a function of - "weather" (whatever that means - maybe a lot of variables).
Put all three buildings into a single EnergyPlus model and build an if or case clause around them (pseudo code):
if (id_building == 1) {
[model the building one]
elseif (if_building == 2) {
[model the building two]
[...]
Define the "weather" or whatever you need as an input variable for the FMU and define id_building also as a parameter. Define the overall heat flux as output variable.
This would allow you to choose the building before starting the simulation.
The two requirements are:
EnergyPlus Syntax allows if or case structures.
All your models work with the same interface (in our example we have weather as in and a flux as out variables)
There is a dirty workaround for the second requirement: Just define all the variables all your models need and only use what you need in the respective if block.

How to export a sparkR model as PMML?

I'm trying to export a sparkR model as PMML.
The first approach was using the pmml library:
library(pmml)
sparkR.session()
data(iris)
df <- createDataFrame(iris)
model <- spark.kmeans(df, Sepal_Length ~ Sepal_Width, k = 4, initMode = "random")
model_pmml <- pmml(model)
The error:
Error in UseMethod("pmml"): no applicable method for 'pmml' applied to an object of class "KMeansModel"
Traceback:
1. pmml(model)
I also investigated if the toPMML method available on scala models could be used from SparkR. I've found a question that suggests it may be possible with Sparklyr, but not with SparkR.
Any ideas?
I have come to the conclusion that exporting a spark R model is not supported. I have added a feature request for this: https://issues.apache.org/jira/browse/SPARK-21430. Please vote on the jira ticket if you are also looking for this functionality.

Mongoengine Django Rest Framework - Serializer Error - ReferenceField is not JSON serializable

Everything works great until the ObjectID value of the ReferenceField no longer points to a valid document. Then The ObjectID is left as the value, and json doesn't know how to serialize this.
How do I deal with invalid ReferenceFields?
E.g.
class Food(Document):
name = StringField()
owner = ReferenceField("Person")
class Person(Document):
first_name = StringField()
last_name = StringField()
...
p = Person(...)
apple = Food(name="apple", owner=p)
p.delete() # might be the wrong method, but you get the idea
At this point, attempting to fetch a list of foods via the REST API will fail with the is not JSON serializable error, since apple.owner no longer points to an owner that exists.
Since you are using DRF with mongoengine, you must be using django-rest-framework-mongoengine.
Apparenly, its a bug in django-rest-framework-mongoengine. Check this open issue on Github which was reported recently regarding the same.
https://github.com/umutbozkurt/django-rest-framework-mongoengine/issues/91
One way is to write your own JSONEncoder for this. This link might help.
Another option is to use the json_util library of Pymongo. They provide explicit BSON conversion to and from json.
As per json-util docs:
This module provides two helper methods dumps and loads that wrap the
native json methods and provide explicit BSON conversion to and from
json. This allows for specialized encoding and decoding of BSON
documents into Mongo Extended JSON‘s Strict mode. This lets you encode
/ decode BSON documents to JSON even when they use special BSON types.

Convert multiple querysets to json in django

I asked a related question earlier today
In this instance, I have 4 queryset results:
action_count = Action.objects.filter(complete=False, onhold=False).annotate(action_count=Count('name'))
hold_count = Action.objects.filter(onhold=True, hold_criteria__isnull=False).annotate(action_count=Count('name'))
visible_tags = Tag.objects.filter(visible=True).order_by('name').filter(action__complete=False).annotate(action_count=Count('action'))
hidden_tags = Tag.objects.filter(visible=False).order_by('name').filter(action__complete=False).annotate(action_count=Count('action'))
I'd like to return them to an ajax function. I have to convert them to json, but I don't know how to include multiple querysets in the same json string.
I know this thread is old, but using simplejson to convert django models doesn't work for many cases like decimals ( as noted by rebus above).
As stated in the django documentation, serializer looks like the better choice.
Django’s serialization framework provides a mechanism for
“translating” Django models into other formats. Usually these other
formats will be text-based and used for sending Django data over a
wire, but it’s possible for a serializer to handle any format
(text-based or not).
Django Serialization Docs
You can use Django's simplejson module. This code is untested though!
from django.utils import simplejson
dict = {
'action_count': list(Action.objects.filter(complete=False, onhold=False).annotate(action_count=Count('name')).values()),
'hold_count': list(Action.objects.filter(onhold=True, hold_criteria__isnull=False).annotate(action_count=Count('name')).values()),
...
}
return HttpResponse( simplejson.dumps(dict) )
I'll test and rewrite the code as necessary when I have the time to, but this should get you started.

Resources