I have trained a RASA NLU model with the following config
language: en
pipeline:
- name: "pretrained_embeddings_convert"
This configuration defaults to the list of components,
language: "en"
pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "SpacyFeaturizer"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"
- name: "SklearnIntentClassifier"
Also I have tried all the other readily available configs like supervised_embeddings and pretrained_embeddings_spacy and custom configs as well. All of it takes 6~9 seconds of load time for instantiating the Trainer object. Similarly when I tried to load the persisted model for inference,
interpreter = Interpreter.load('../path_to_trained_model')
again it takes almost of same 6~9 seconds for loading it. Is there anyway that this can be mitigated ? or I am doing something wrong ? Because I want to serve these models on demand, which requires a faster load time.
supervised_embeddings (see here) is the pipeline that will have the shortest load time, because it doesn't load up any pre-trained word embeddings.
pretained_embeddings_convert and pretrained_embeddings_spacy both load up word embeddings at start up (the ConveRT and spacy embeddings respectively), which are quite large in size and therefore take some time to load.
So if a fast load time is important, I would recommend using supervised_embeddings. However, once a model is loaded the inference time is relatively fast, so you only need to wait those few seconds once.
PS: if you haven't already please join our community forum, where questions like these are answered frequently
Related
I have the same workflow on two different environments. To validate that both workflows are identical, I feed the same input data to both workflows. If they are identical, I am expecting the output dataset of each workflow to be same.
In this requirement, I cannot alter the workflow in any way (add/remove DAG's etc.).
Which tool is best suited for this use case? I was reading up on data validation frameworks like Apache Griffin and Great Expectations. Can either of this be used for this use case? Or is there a simpler alternative?
Update: I forgot to mention that I want the validation process to be as non interactive as possible. Reading the Great Expectations tutorial, it talks about manually opening & running Jupyter notebooks and I want to minimize processes like this as much as possible. If that makes sense.
Update 2:
Dataset produced by workflow in first environment:
Name
Value
ABC
10
DEF
20
Dataset produced by workflow in second environment:
Name
Value
DEF
20
ABC
10
After running validation, I want the output to be that both datasets are identical even though they are in a different order.
I have a processor that generates time series data in JSON format. Based on the received data I need to make a forecast using machine learning algorithms on python. Then write the new forecast values to another flow file.
The problem is: when you run such a python script, it must perform many massive preprocessing operations: queries to a database, creating a complex data structure, initializing forecasting models, etc.
If you use ExecuteStreamCommand, then for each flow file the script will be run every time. Is this true?
Can I make in NIFI a python script that starts once and receives the flow files many times, storing the history of previously received data. Or do I need to make an HTTP service that will receive data from NIFI?
You have a few options:
Build a custom processor. This is my suggested approach. The code would need to be in Java (or Groovy, which provides a more Python-like experience) but would not have Python dependencies, etc. However, I have seen examples of this approach for ML model application (see Tim Spann's examples) and this is generally very effective. The initialization and individual flowfile trigger logic is cleanly separated, and performance is good.
Use InvokeScriptedProcessor. This will allow you to write the code in Python and separate the initialization (pre-processing, DB connections, etc., onScheduled in NiFi processor parlance) with the execution phase (onTrigger). Some examples exist but I have not personally pursued this with Python specifically. You can use Python dependencies but not "native modules" (i.e. compiled C code), as the execution engine is still Jython.
Use ExecuteStreamCommand. Not strongly recommended. As you mention, every invocation would require the preprocessing steps to occur, unless you designed your external application in such a way that it ran a long-lived "server" component and each ESC command sent data to it and returned an individual response. I don't know what your existing Python application looks like, but this would likely involve complicated changes. Tim has another example using CDSW to host and deploy the model and NiFi to send it data via HTTP to evaluate.
Make a Custom Processor that can do that. Java is more appropriate. I believe you can do pretty much every with Java you just need to find libraries. Yes, there might be some issues with some initialization and preprocessing that can be handled by all that in the init function of nifi that will allow you preserve the state of certain components.
Link in my use case I had to build a custom processor that could take in images and apply count the number of people in that image. For that, I had to load a deep learning model once in the init method and after through on trigger method, it could be taking the reference of that model every time it processes an image.
I have some pre-trained word2vec model and I'd like to evaluate them using the same corpus. Is there a way I could get the raw training loss given a model dump file and the corpus in memory?
The training-loss reporting of gensim's Word2Vec (& related models) is a newish feature that doesn't quite yet work the way most people expect.
For example, at least through gensim 3.7.1 (January 2019), you can just retrieve the total loss since the last call to train() (across multiple epochs). Some pending changes may eventually change that.
The loss-tallying is only done if requested when the model is created, via the compute_loss parameter. So if the model wasn't initially configured with this setting there will be no loss data inside it about prior training.
You could presumably tamper with the loaded model, w2v_model.compute_loss = False, so that further calls to train() (with the same or new data) would collect loss data. However, note that such training will also be updating the model, with respect the current data.
You could also look at the score() method, available for some model modes, which reports a loss-related number for batches of new texts, without changing the model. It may essentially work as a way to assess whether new texts "seem like" the original training data. See the method docs, including links to the motivating academic paper and an example notebook, for more info:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.score
AutoML makes two learners, one that includes "all" and the other that is a subset that is "best of family".
Is there any way to not-manually save the components and stacked ensemble aggregator to disk so that that "best of family", treated as a standalone black-box, can be stored, reloaded, and used without requiring literally 1000 less valuable learners to exist in the same space?
If so, how do I do that?
While running AutoML everything runs in memory (nothing is saved to disk unless you save one of the models to disk - or apply the option of saving an object to disk).
If you just want the "Best of Family" stacked ensemble, all you have to do is save that binary model. When you save a stacked ensemble, it saves all the required pieces (base models and meta model) for you. Then you can re-load later for use with another H2O cluster when you're ready to make predictions (just make sure, if you are saving a binary model, that you can use the same version of H2O later on).
Python Example:
bestoffamily = h2o.get_model('StackedEnsemble_BestOfFamily_0_AutoML_20171121_012135')
h2o.save_model(bestoffamily, path = "/home/users/me/mymodel")
R Example:
bestoffamily <- h2o.getModel('StackedEnsemble_BestOfFamily_0_AutoML_20171121_012135')
h2o.saveModel(bestoffamily, path = "/home/users/me/mymodel")
Later on, you re-load the stacked ensemble into memory using h2o.load_model() in Python or h2o.loadModel() in R.
Alternatively, instead of using an H2O binary model, which requires an H2O cluster to be running at prediction time, you can use a MOJO model (different model format). It's a bit more work to use MOJOs, though they are faster and designed for production use. If you want to save a MOJO model instead, then you can use h2o.save_mojo() in Python or h2o.saveMojo() in R.
Where can I find Pentaho Kettle architecture? I'm looking for a short wiki, design document, blog post, anything to give a good overview on how things work. This question is not meant for specific "how to" starting guides but rather a good view at the technology and architecture.
Specific questions I have are:
How does data flow between steps? It would seem everything is in memory - am I right about this?
Is the above true about different transformations as well?
How are the Collect steps implemented?
Any specific performence guidelines to using it?
Is the ftp task reliable and performant?
Any other "Dos and Don'ts" ?
See this PDF.
How does data flow between steps? It would seem everything is in
memory - am I right about this?
Data flow is row-based. For transformation every step produce a 'tuple' or a row with fields. Every field is pair of data and a metadata. Every step has input and output. Step takes rows from input, modify rows and send rows to outputs. For most cases every all information is in memory. But. Steps reads data in streaming fashion (like jdbc or other) - so typically in memory only a part of data from a stream.
Is the above true about different transformations as well?
There is a 'job' concept and 'transformation' concept. All written above is mostly true for transformation. Mostly - means transformation can contain very different steps, some of them - like collect steps - can try to collect all data from a stream. Jobs - is a way to perform some actions that do not follow 'streaming' concept - like send email on success, load some files from net, execute different transformations one by one.
How are the Collect steps implemented?
It only depend on particular step. Typically as said above - collect steps may try to collect all data from stream - having so - can be a reason of OutOfMemory exceptions. If data is too big - consider replace 'collect' steps with different approach to process data (for example use steps that do not collect all data).
Any specific performence guidelines to using it?
A lot of. Depends on steps transformation is consists, sources of data used. I would try to speak on exact scenario rather then general guidelines.
Is the ftp task reliable and performant?
As far as I remember ftp is backed by EdtFTP implementation, and there may be some issues with that steps like - some parameters not saved, or http-ftp proxy not working or other. I would say Kettle in general is reliable and perfomant - but for some not commonly used scenarios - it can be not so.
Any other "Dos and Don'ts" ?
I would say the Do - is to understand a tool before starting use it intensively. As mentioned in this discussion - there is a couple of literature on Kettle/Pentaho Data Integration you can try search for it on specific sites.
One of advantages of Pentaho Data Integration/Kettle is relatively big community you can ask for specific aspects.
http://forums.pentaho.com/
https://help.pentaho.com/Documentation