I am using pyfmi to do simulations with EnergyPlus. I recognized that initializing the individual EnergyPlus models takes quite some time. Therefore, I hope to find a way to initialize the models in parallel. I tried the python library multiprocessing with no success. If it matters, I am on Ubuntu 16.10 and use Python 3.6.
Here is what I want to get done in serial:
fmus = {}
for id in id_list:
chdir(fmu_path+str(id))
fmus[id] = load_fmu('f_' + str(id)+'.fmu',fmu_path+str(id))
fmus[id].initialize(start_time,final_time)
The result is a dictionary with ids as key and the models as value: {id1:FMUModelCS1,id2:FMUModelCS1}
The purpose is to call later the models by their key and do simulations.
Here is my attempt with multiprocessing:
def ep_intialization(id,start_time,final_time):
chdir(fmu_path+str(id))
model = load_fmu('f_' + str(id)+'.fmu',fmu_path+str(id))
model.initialize(start_time,final_time)
return {id:model}
data = ((id,start_time,final_time) for id in id_list)
if __name__ == '__main__':
pool = Pool(processes=cpus)
pool.starmap(ep_intialization, data)
pool.close()
pool.join()
I can see the processes of the models in my system monitor but then the script raise an error because the models are not pickable:
MaybeEncodingError: Error sending result: '[{id2: <pyfmi.fmi.FMUModelCS1 object at 0x561eaf851188>}]'. Reason: 'TypeError('self._fmu,self.callBackFunctions,self.callbacks,self.context,self.variable_list cannot be converted to a Python object for pickling',)'
But I cannot imagine that there is no way to initialize the models in parallel. Other frameworks/libraries than threading/multiprocessing are also welcome.
I saw this answer but it seems that it focuses on the simulations after initialization.
The answer below the one you refer to seems to explain what the problem with multiprocessing and FMU instantiation is.
I tried with pathos suggested in this answer, but run into the same problem:
from pyfmi import load_fmu
from multiprocessing import Pool
from os import chdir
from pathos.multiprocessing import Pool
def ep_intialization(id):
chdir('folder' + str(id))
model = load_fmu('BouncingBall.fmu')
model.initialize(0,10)
return {id:model}
id_list = [1,2]
cpus = 2
data = ((id) for id in id_list)
pool = Pool(cpus)
out = pool.map(ep_intialization, data)
This gives:
MaybeEncodingError: Error sending result: '[{1: <pyfmi.fmi.FMUModelME2 object at 0x564e0c529290>}]'. Reason: 'TypeError('self._context,self._fmu,self.callBackFunctions,self.callbacks cannot be converted to a Python object for pickling',)'
Here is another idea:
I suppose the instantiation is slow because EnergyPlus links plenty of libraries into the FMU. If the components you are modelling all have the same interface (input, output, parameters), you can probably use a single FMU with an additional parameter that switches between the models.
This would be much more efficient: You would only have to instantiate a single FMU and could call it in parallel with different parameters and inputs.
Example:
I have never worked with EnergyPlus, but maybe the following example will illustrate the approach:
You have three variants of a building and you are merely interested in the total heat flux over the entire surface area of buildings as a function of - "weather" (whatever that means - maybe a lot of variables).
Put all three buildings into a single EnergyPlus model and build an if or case clause around them (pseudo code):
if (id_building == 1) {
[model the building one]
elseif (if_building == 2) {
[model the building two]
[...]
Define the "weather" or whatever you need as an input variable for the FMU and define id_building also as a parameter. Define the overall heat flux as output variable.
This would allow you to choose the building before starting the simulation.
The two requirements are:
EnergyPlus Syntax allows if or case structures.
All your models work with the same interface (in our example we have weather as in and a flux as out variables)
There is a dirty workaround for the second requirement: Just define all the variables all your models need and only use what you need in the respective if block.
Related
I need to process a couple of boolean options, and I am trying to do it like it is usually done in C:
DICT = 0x000020000
FILTER = 0x000040000
HIGH = 0x000080000
KEEP = 0x000100000
NEXT = 0x000200000
I can now assign arbitrary options to a Integer variable, and test for them:
action if (opts & HIGH|KEEP) != 0
But this looks ugly and gets hard to read. I would prefer writing it like
action if opts.have HIGH|KEEP
This would require to load have method onto Integer class.
The question now is: where would I do that, in order to keep this method contained to the module where those options are used and the classes that include this module? I don't think it's a good idea to add it globally, as somebody might define another have somewhere.
Or, are there better approaches, for the general task or for the given use-case? Adding a separate Options class looks like overkill - or should I?
You can use anybits?:
action if opts.anybits?(HIGH|KEEP)
The methods returns true if any bits from the given mask are set in the receiver, and false otherwise.
From my reading of the LightGBM document, one is supposed to define categorical features in the Dataset method. So I have the following code:
cats=['C1', 'C2']
d_train = lgb.Dataset(X, label=y, categorical_feature=cats)
However, I received the following error message:
/app/anaconda3/anaconda3/lib/python3.7/site-packages/lightgbm/basic.py:1243: UserWarning: Using categorical_feature in Dataset.
warnings.warn('Using categorical_feature in Dataset.')
Why did I get the warning message?
I presume that you get this warning in a call to lgb.train. This function also has argument categorical_feature, and its default value is 'auto', which means taking categorical columns from pandas.DataFrame (documentation). The warning, which is emitted at this line, indicates that, despite lgb.train has requested that categorical features be identified automatically, LightGBM will use the features specified in the dataset instead.
To avoid the warning, you can give the same argument categorical_feature to both lgb.Dataset and lgb.train. Alternatively, you can construct the dataset with categorical_feature=None and only specify the categorical features in lgb.train.
Like user andrey-popov described you can use the lgb.train's categorical_feature parameter to get rid of this warning.
Below is a simple example with some code how you could do it:
# Define categorical features
cat_feats = ['item_id', 'dept_id', 'store_id',
'cat_id', 'state_id', 'event_name_1',
'event_type_1', 'event_name_2', 'event_type_2']
...
# Define the datasets with the categorical_feature parameter
train_data = lgb.Dataset(X.loc[train_idx],
Y.loc[train_idx],
categorical_feature=cat_feats,
free_raw_data=False)
valid_data = lgb.Dataset(X.loc[valid_idx],
Y.loc[valid_idx],
categorical_feature=cat_feats,
free_raw_data=False)
# And train using the categorical_feature parameter
lgb.train(lgb_params,
train_data,
valid_sets=[valid_data],
verbose_eval=20,
categorical_feature=cat_feats,
num_boost_round=1200)
This is less of an answer to the original OP and more of an answer to people who are using sklearn API and encounter this issue.
For those of you who are using sklearn API, especially using one of the cross_val methods from sklearn, there are two solutions you could consider using.
Sklearn API solution
A solution that worked for me was to cast categorical fields into the category datatype in pandas.
If you are using pandas df, LightGBM should automatically treat those as categorical. From the documentation:
integer codes will be extracted from pandas categoricals in the
Python-package
It would make sense for this to be the equivalent in the sklearn API to setting categoricals in the Dataset object.
But keep in mind that LightGBM does not officially support virtually any of the non-core parameters for sklearn API, and they say so explicitly:
**kwargs is not supported in sklearn, it may cause unexpected issues.
Adaptive Solution
The other, more sure-fire solution to being able to use methods like cross_val_predict and such is to just create your own wrapper class that implements the core Dataset/Train under the hood but exposes a fit/predict interface for the cv methods to latch onto. That way you get the full functionality of lightGBM with only a little bit of rolling your own code.
The below sketches out what this could look like.
class LGBMSKLWrapper:
def __init__(self, categorical_variables, params):
self.categorical_variables = categorical_variables
self.params = params
self.model = None
def fit(self, X, y):
my_dataset = ltb.Dataset(X, y, categorical_feature=self.categorical_variables)
self.model = ltb.train(params=self.params, train_set=my_dataset)
def predict(self, X):
return self.model.predict(X)
The above lets you load up your parameters when you create the object, and then passes that onto train when the client calls fit.
Coming from AngularJS I'm struggling trying to solve the next problem. I need a function that returns an object (lets call it A, but this object cannot be returned till all the requests that are contained in that function are resolved. The process should be like:
The object A is downloaded from a remote server
Using A, we do operations over another object (B)
B is downloaded from the server
B is patched using some attributes from A
Using A and the result of B we do operations over a third object, C
C is downloaded from the server
C is patched using some attributes from A and B
After B and C are processed, the function must return A
I'd like to understand how to do something like this using rxjs, but with Angular 6 most of the examples around the internet seem to be deprecated, and the tutorials out there are not really helping me. And I cannot modify the backend to make this a bit more elegant. Thanks a lot.
Consider the following Observables:
const sourceA = httpClient.get(/*...*/);
const sourceB = httpClient.get(/*...*/);
const sourceC = httpClient.get(/*...*/);
Where httpClient is Angular's HTTPClient.
The sequence of the operations you described may look as follows:
const A = sourceA.pipe(
switchMap(a => sourceB.pipe(
map(b => {
// do some operation using a and b.
// Return both a and b in an array, but you can
// also return them in an object if you wish.
return [a,b];
})
)),
switchMap(ab => sourceC.pipe(
map(c => {
// do some operations using a, b, and/or c.
return a;
})
))
);
Now you just need to subscribe to A:
A.subscribe(a => console.log(a));
You can read about RxJs operators here.
Well, first of all, it appears to me that this function-call, as described, would be somehow expected to block the calling process until all of the specified events have occurred – which of course is unreasonable in JavaScript.
Therefore, first of all, I believe that your function should require, as its perhaps-only parameter, a callback that will be invoked when everything has finally taken place.
Now – as to "how to handle steps 1, 2, and 3 elegantly" ... what immediately comes to mind is the notion of a finite-state machine (FSM) algorithm.
Let's say that your function-call causes a new "request" to be placed on some request-table queue, and, if necessary, a timer-request (set to go off in 1 millisecond) to service that queue. (This entry will contain, among other things, a reference to your callback.) Let's assume also that the request is given a random-string "nonce" that will serve to uniquely identify it: this will be passed to the various external requests and must be included in their corresponding replies.
The FSM idea is that the request will have a state, (attribute), such as: DOWNLOADING_FROM_B, B_DOWNLOADS_COMPLETE, DOWNLOADING_FROM_C, C_REQUESTS_COMPLETE, and so on. Such that each and every callback that will play a part in this fully-asynchronous process will (1) be able to locate a request-entry by its nonce, and then (2) unambiguously "know what to do next," and "what new-state (if any) to assign," based solely upon examination of the entry's state.
For instance, when the state reaches C_REQUESTS_COMPLETE, it would be time to invoke the callback that you originally provided, and to delete the request-table entry.
You can easily map-out all of the "state transitions" that might occur in an arbitrarily-complex scenario (what states can lead to what states, and what to do when they do), whether or not you actually create a data-structure to represent that so-called "state table," although sometimes it is even-more elegant(!) when you do. (Possibly-messy decision logic is simply pushed to a simple table-lookup.)
This is, of course, a classic algorithm that is applicable to – and, has been used in – "every programming language under the sun." (Lots of hardware devices use it, too.)
I have a bottle.py app that should load some data, parts of which get served depending on specific routes. (This is similar to memcached in principle, except the data isn't that big and I don't want the extra complexity.) I can load the data into global variables which are accessible from each function I write, but this seems less clean. Is there any way to load some data into a Bottle() instance during initialization?
You can do it by using bottle.default_app
Here's simple example.
main.py (used sample code from http://bottlepy.org/docs/dev/)
import bottle
from bottle import route, run, template
app = bottle.default_app()
app.myvar = "Hello there!" # add new variable to app
#app.route('/hello/<name>')
def index(name='World'):
return template('<b>Hello {{name}}</b>!', name=name)
run(app, host='localhost', port=8080)
some_handler.py
import bottle
def show_var_from_app():
var_from_app = bottle.default_app().myvar
return var_from_app
For a project of mine, I want to analyse around 2 TB of Protobuf objects. I want to consume these objects in a Pig Script via the "elephant bird" library. However it is not totally clear to my how to write a file to HDFS so that it can be consumed by the ProtobufPigLoader class.
This is what I have:
Pig script:
register ../fs-c/lib/*.jar // this includes the elephant bird library
register ../fs-c/*.jar
raw_data = load 'hdfs://XXX/fsc-data2/XXX*' using com.twitter.elephantbird.pig.load.ProtobufPigLoader('de.pc2.dedup.fschunk.pig.PigProtocol.File');
Import tool (parts of it):
def getWriter(filenamePath: Path) : ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File] = {
val conf = new Configuration()
val fs = FileSystem.get(filenamePath.toUri(), conf)
val os = fs.create(filenamePath, true)
val writer = new ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File](os, classOf[de.pc2.dedup.fschunk.pig.PigProtocol.File])
return writer
}
val writer = getWriter(new Path(filename))
val builder = de.pc2.dedup.fschunk.pig.PigProtocol.File.newBuilder()
writer.write(builder.build)
writer.finish()
writer.close()
The import tool runs fine. I had a few problems with the ProtobufPigLoader because I cannot use the hadoop-lzo compression library, and without a fix (see here) ProtobufPigLoader isn't working. The problem where I have problems is that DUMP raw_data; returns Unable to open iterator for alias raw_data and ILLUSTRATE raw_data; returns No (valid) input data found!.
For me, it looks like the ProtobufBlockWriter data cannot be read by the ProtobufPigLoader. But what to use instead? How to write data in a external tool to HDFS so that it can be processed by ProtobufPigLoader.
Alternative question: What to use instead? How to write pretty large objects to Hadoop to consume it with Pig? The objects are not very complex, but contain a large list of sub-objects in a list (repeated field in Protobuf).
I want to avoid any text format or JSON because they are simply to large for my data. I expect that it would bloat up the data by a factor of 2 or 3 (lots of integer, lots of byte strings that I would need to encode as Base64)..
I want to avoid normalizing the data so that the id of the main object is attached to each of the subobjects (this is what is done now) because this also blows up the space consumption and makes joins necessary in the later processing.
Updates:
I didn't use the generation of protobuf loader classes, but use the reflection type loader
The protobuf classes are in a jar that is registered. DESCRIBE correctly shows the types.