How to get word2vec training loss in Gensim from pretrained models? - gensim

I have some pre-trained word2vec model and I'd like to evaluate them using the same corpus. Is there a way I could get the raw training loss given a model dump file and the corpus in memory?

The training-loss reporting of gensim's Word2Vec (& related models) is a newish feature that doesn't quite yet work the way most people expect.
For example, at least through gensim 3.7.1 (January 2019), you can just retrieve the total loss since the last call to train() (across multiple epochs). Some pending changes may eventually change that.
The loss-tallying is only done if requested when the model is created, via the compute_loss parameter. So if the model wasn't initially configured with this setting there will be no loss data inside it about prior training.
You could presumably tamper with the loaded model, w2v_model.compute_loss = False, so that further calls to train() (with the same or new data) would collect loss data. However, note that such training will also be updating the model, with respect the current data.
You could also look at the score() method, available for some model modes, which reports a loss-related number for batches of new texts, without changing the model. It may essentially work as a way to assess whether new texts "seem like" the original training data. See the method docs, including links to the motivating academic paper and an example notebook, for more info:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.score

Related

In Google Cloud Platform «Start training» is disable

I am training a model in GCP's AutoML Natural Language Entity extraction.
I have 50+ annotations for each label but still cant start training a model.
Take a look at a screenshot of the
train section. The Start training button remains grey and cannot be selected.
Looking at the screenshot it seems as if you may be talking about training an AutoML Entity Extraction model. Then, this issue seems the same as in Unable to start training my GCP AutoML Entity Extraction model on Web UI
There are thus a couple of reasons that may result in this behavior:
Your dataset are located in a specific region (e.g. "EU") and you need to specify the proper endpoint, as shown in the official documentation.
You might need to increase the number of "Training items per label" to 100 at minimum (see Natural Language limits).
From the aforementioned post, the solution seems to be the first one.

What would be the preferred data structure for a graphql resolver to return the counts of ratings?

View :
Json that I return from the graphql field resolver.
The json is direct response of the sql query:
The front-end dev says the following:
To which I responded:
I feel this is case of overengineering and it is the responsibility of view to convert and use the data according to their view needs. I understand the need of caching the counts to optimize the query response and it has nothing to do with the arrray vs json format. The front-end dev wasn't convinced by my response and thinks this will cause performance issue and I'm failing to understand it and he has asked me to seek opinions from the stackoverflow community. Your enlightenment on this would be appreciated. I'd learn something out of it maybe. :)
For the small amount of data like this, it's not a issue from view side to render. As per my perspective for the object and array structures, One should go for the object case in these scenarios. because currently ratings are displayed by stars, what if in the future it will be converted into graphs or other kind of representations.
In those cases the change will be required from both side as you have tightly coupled your view into server side logic. If you go with object, that will be only from view side and server will be independent from view.
Not only objects give you decouple environment, but in the future if you want to add some extra information, it will be easy for both the views. currently it's only number specific, what if in the future, view needs more information based on some profile like users, areas, eventually you will need to convert that into these structures. so, It would be more fruitful if you go with objects.
From the front side, if you want to optimize you can use memorize function with dependencies or normalization of objects, which will help the view to not the process that
much.
For the logic of reducer function to convert them into array, it's some what over operations, assuming find method does linear scan and your data is sorted so, find method quite reaches to 15 for five data(best case). You can go with sorting, it's roughly completes to 11, and you get more efficiency and scalability compare to normal array. since most of the sorting methods accept custom function to sort.
Any correction or other more options would be highly appreciated.

How to quickly prepare rasa training data

I am going to build a chat bot from scratch with rasa.The biggest difficulty now is how to automate production training data.Training data includes nlu.md and stories.md .
I have tried rasa-nlu-trainer and Chatito,But there are still a lot of manual operations,If there are tens of thousands of corpora in the future.How to mark the data to make the data meet the data format of nlu.md and stories.md
Is there an automated tool or program to do this? Thanks a lot!
Well, if you're doing anything ML related, your data is the most important thing that you'll need for the model to learn from. And because we want the model to learn from that data, we create the data and then train the model with it. What you're asking for is for something to somehow create the data for it. It's precisely because there doesn't exist anything like that that we create datasets to train the AI on, by ourselves, so that the model learns form it. So, if you automate the data creation process, what do you expect the model to learn?
So, you can't create the data automatically because if that were possible, we would already have had Artificial General Intelligence (AGI) by now.
But if your goal is to just format the data then you can just write a script for that.

h2o subset to "bestofFamily"

AutoML makes two learners, one that includes "all" and the other that is a subset that is "best of family".
Is there any way to not-manually save the components and stacked ensemble aggregator to disk so that that "best of family", treated as a standalone black-box, can be stored, reloaded, and used without requiring literally 1000 less valuable learners to exist in the same space?
If so, how do I do that?
While running AutoML everything runs in memory (nothing is saved to disk unless you save one of the models to disk - or apply the option of saving an object to disk).
If you just want the "Best of Family" stacked ensemble, all you have to do is save that binary model. When you save a stacked ensemble, it saves all the required pieces (base models and meta model) for you. Then you can re-load later for use with another H2O cluster when you're ready to make predictions (just make sure, if you are saving a binary model, that you can use the same version of H2O later on).
Python Example:
bestoffamily = h2o.get_model('StackedEnsemble_BestOfFamily_0_AutoML_20171121_012135')
h2o.save_model(bestoffamily, path = "/home/users/me/mymodel")
R Example:
bestoffamily <- h2o.getModel('StackedEnsemble_BestOfFamily_0_AutoML_20171121_012135')
h2o.saveModel(bestoffamily, path = "/home/users/me/mymodel")
Later on, you re-load the stacked ensemble into memory using h2o.load_model() in Python or h2o.loadModel() in R.
Alternatively, instead of using an H2O binary model, which requires an H2O cluster to be running at prediction time, you can use a MOJO model (different model format). It's a bit more work to use MOJOs, though they are faster and designed for production use. If you want to save a MOJO model instead, then you can use h2o.save_mojo() in Python or h2o.saveMojo() in R.

Tensorflow memory issue

I have a problem with tensorflow.
I need to create several model (e.g. neural networks), but after the computation of the parameters of such models, I will create new models, and I won't need the previous models anymore.
Seems that tensorflow is not able to recognize which model I am still using, and which ones are without reference anymore, and I don't know in which way should I delete the previous models. As result the memory keep increasing its size, until the system kills my execution, which, obviously, is something that I would like to avoid.
How do you think I should deal with this problem? What's the correct way to 'delete' the previous models?
thanks in advance,
Samuele

Resources