AutoML makes two learners, one that includes "all" and the other that is a subset that is "best of family".
Is there any way to not-manually save the components and stacked ensemble aggregator to disk so that that "best of family", treated as a standalone black-box, can be stored, reloaded, and used without requiring literally 1000 less valuable learners to exist in the same space?
If so, how do I do that?
While running AutoML everything runs in memory (nothing is saved to disk unless you save one of the models to disk - or apply the option of saving an object to disk).
If you just want the "Best of Family" stacked ensemble, all you have to do is save that binary model. When you save a stacked ensemble, it saves all the required pieces (base models and meta model) for you. Then you can re-load later for use with another H2O cluster when you're ready to make predictions (just make sure, if you are saving a binary model, that you can use the same version of H2O later on).
Python Example:
bestoffamily = h2o.get_model('StackedEnsemble_BestOfFamily_0_AutoML_20171121_012135')
h2o.save_model(bestoffamily, path = "/home/users/me/mymodel")
R Example:
bestoffamily <- h2o.getModel('StackedEnsemble_BestOfFamily_0_AutoML_20171121_012135')
h2o.saveModel(bestoffamily, path = "/home/users/me/mymodel")
Later on, you re-load the stacked ensemble into memory using h2o.load_model() in Python or h2o.loadModel() in R.
Alternatively, instead of using an H2O binary model, which requires an H2O cluster to be running at prediction time, you can use a MOJO model (different model format). It's a bit more work to use MOJOs, though they are faster and designed for production use. If you want to save a MOJO model instead, then you can use h2o.save_mojo() in Python or h2o.saveMojo() in R.
Related
I have a processor that generates time series data in JSON format. Based on the received data I need to make a forecast using machine learning algorithms on python. Then write the new forecast values to another flow file.
The problem is: when you run such a python script, it must perform many massive preprocessing operations: queries to a database, creating a complex data structure, initializing forecasting models, etc.
If you use ExecuteStreamCommand, then for each flow file the script will be run every time. Is this true?
Can I make in NIFI a python script that starts once and receives the flow files many times, storing the history of previously received data. Or do I need to make an HTTP service that will receive data from NIFI?
You have a few options:
Build a custom processor. This is my suggested approach. The code would need to be in Java (or Groovy, which provides a more Python-like experience) but would not have Python dependencies, etc. However, I have seen examples of this approach for ML model application (see Tim Spann's examples) and this is generally very effective. The initialization and individual flowfile trigger logic is cleanly separated, and performance is good.
Use InvokeScriptedProcessor. This will allow you to write the code in Python and separate the initialization (pre-processing, DB connections, etc., onScheduled in NiFi processor parlance) with the execution phase (onTrigger). Some examples exist but I have not personally pursued this with Python specifically. You can use Python dependencies but not "native modules" (i.e. compiled C code), as the execution engine is still Jython.
Use ExecuteStreamCommand. Not strongly recommended. As you mention, every invocation would require the preprocessing steps to occur, unless you designed your external application in such a way that it ran a long-lived "server" component and each ESC command sent data to it and returned an individual response. I don't know what your existing Python application looks like, but this would likely involve complicated changes. Tim has another example using CDSW to host and deploy the model and NiFi to send it data via HTTP to evaluate.
Make a Custom Processor that can do that. Java is more appropriate. I believe you can do pretty much every with Java you just need to find libraries. Yes, there might be some issues with some initialization and preprocessing that can be handled by all that in the init function of nifi that will allow you preserve the state of certain components.
Link in my use case I had to build a custom processor that could take in images and apply count the number of people in that image. For that, I had to load a deep learning model once in the init method and after through on trigger method, it could be taking the reference of that model every time it processes an image.
I have some pre-trained word2vec model and I'd like to evaluate them using the same corpus. Is there a way I could get the raw training loss given a model dump file and the corpus in memory?
The training-loss reporting of gensim's Word2Vec (& related models) is a newish feature that doesn't quite yet work the way most people expect.
For example, at least through gensim 3.7.1 (January 2019), you can just retrieve the total loss since the last call to train() (across multiple epochs). Some pending changes may eventually change that.
The loss-tallying is only done if requested when the model is created, via the compute_loss parameter. So if the model wasn't initially configured with this setting there will be no loss data inside it about prior training.
You could presumably tamper with the loaded model, w2v_model.compute_loss = False, so that further calls to train() (with the same or new data) would collect loss data. However, note that such training will also be updating the model, with respect the current data.
You could also look at the score() method, available for some model modes, which reports a loss-related number for batches of new texts, without changing the model. It may essentially work as a way to assess whether new texts "seem like" the original training data. See the method docs, including links to the motivating academic paper and an example notebook, for more info:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.score
Relatively new to ML and h2o. Is there a way to do collaborative learning/training with h2o? Would prefer a way that uses the flow UI, else woud be using python.
My use case is that there would be new feature samples x=[a, b, c, d] periodically coming into a system where an h2o algorithm (say, running from a java program using a MOJO) assigns a binary class that users should be able to manually reclassify as either good(0) or bad(1), at which point these samples (with their newly assigned responses) get sent back to theh h2o algorithm to be used to further train it.
Thanks
FLOW UI is great for prototyping something very quick with H2O without writing a single like of code. You can ingest the data, build desired model and the evaluate the results. Unfortunately FLOW UI is can not be extended for the reason you asked, and FLOW is limited for that reason.
For collaborative learning you can write your whole application directly in python or R and it will work as expected.
Using h2o steam's prediction service for a deployed model, the default threshold that seems to be used by the prediction service is the max f1 threshold. However, in my case I would like the be able to use other thresholds (as displayed by the model when built in h2o flow) (eg. max f2 or max accuracy thresholds) like these.
Is there a way to set these thresholds in steam?
Looking at the inspector on the prediction service page, seems to shows that the logic for the predictor is from a script called "predict.js" (see below):
But I can't find where in the steam launch directory (running from local host based on these instructions) these files are (doing a file search in this directory for anything named "predict.js" returns nothing).
I believe for POJOs there's no way to change it as it's hardcoded. Don't know the answer for MOJOs.
Suppose I have to create functionalities A, B and C through custom coding in Drupal using hooks.
Either I can club three of them in custom1.module or I can create three separate modules for them, say custom1.module, custom2.module and custom3.module.
Benefits of creating three modules:
Clean code
Easily searchable
Mutually independent
Easy to commit in multi-developer projects
Cons:
Every module entry gets stored in the database and requires a query.
To what extent does it mar the performance of the site?
Is it better to create a single large custom module file for the sake of reducing database queries or break it into different smaller ones?
This issue might be negligible for small scale websites, let the case be for large scale performance oriented sites.
I would code it based on how often do I need function A , B and C
Actual example:
I made a module which had two needs
1) Send periodic emails based on user preference. Lets call this function A
2) Custom content made in a module . Lets call this function B
3) Social integration . Lets call this function C
So what I did is as Function A is only called once a week I made a separate module for it.
As for Function B and C I put it all together as they would always be called together.
If you have problems with performance then check out this link . Its a good resource for performance improvement.
http://www.vmirgorod.name/10/11/5/tuning-drupal-performance
It lists a nice module called boost. I have not used it but I have heard good things about it.
cheers,
Vishal
Drupal .module files are all loaded with every page load. There is very little performance related that can be gained or lost simply by separating functions into different .module files. If you are not using an op code cache, then you can get improved performance by creating .inc files and referencing those files in the menu items in hook_menu, such that those files only get loaded when the menu items are accessed. Then infrequently called functions do not take up memory space.
File separation in general is a very small performance issue compared to how the module is designed with respect to caching, memory use, and/or database access and structure. Let the code maintenance and dependency issues drive when you do or do not create separate modules.
i'm actually interested in:
Is it better to create a single large custom module file for the sake of reducing database queries or break it into different smaller ones?
I was poking around and found a few things regarding benchmarking the database. Maybe the suggestion here is to fire up the dev version and test. check out db benchmarking
now i understand that doesn't answer specifically but i'd have to say its unique to each environment. I hate to use that type of answer but i truly believe it is. Depends on modules installed, versions used, hardware and os tunables among many other things.