I couldn't find any information about the feature selection mechanism in the stanford nlp text classifier.
Does the ColumnDataClassifier performs any feature selection by default. Following line is from the output for 20 news group data.
numFeatures (Phi(X) types): 245343 [CLASS, 2-SW-et, 2-SW-stop, 2-SW-somebody, 2-SW-organizating, ...]
With 245343 features I don't think it is possible for this tool to be so fast and use memory less than 2G. When I try to train a model on WEKA with the same data set but less features (45000), WEKA uses 8G of memory and takes forever to train the model.
Read a set of training examples from a file, and return the data in a featurized form. If feature selection is asked for, the returned featurized form is after feature selection has been applied.
Stanford NLP
Related
I am training a model in GCP's AutoML Natural Language Entity extraction.
I have 50+ annotations for each label but still cant start training a model.
Take a look at a screenshot of the
train section. The Start training button remains grey and cannot be selected.
Looking at the screenshot it seems as if you may be talking about training an AutoML Entity Extraction model. Then, this issue seems the same as in Unable to start training my GCP AutoML Entity Extraction model on Web UI
There are thus a couple of reasons that may result in this behavior:
Your dataset are located in a specific region (e.g. "EU") and you need to specify the proper endpoint, as shown in the official documentation.
You might need to increase the number of "Training items per label" to 100 at minimum (see Natural Language limits).
From the aforementioned post, the solution seems to be the first one.
I am going to work on some projects to deal with entity deduplication. Datasets (one or more) which may contain duplicate entity. In the realtime, entity may represent the name, address, country, email, social media id in the different form. My goal is to identify that these are possible duplicates based on different weightage for the different entity Info. I am trying to look for a library that is open-source & preferably written in Java.
As I need to process the millions of data, I need to take concern on scaling and performance. Also, the performance should not be in the order of n^2. In the below findings, some use Index-based search using Lucene and some use Data grouping.
Please pour the suggestion which one is better?
Here are my findings so far:
Duke (Java/Lucene)
Comments: Uses genetic algorithms, it's flexible. Since 2016, there had been any updates.
YannBrrd/elasticsearch-entity-resolution (extension of Duke)
Comments: Since 2017, there had been any updates. Also, need to check whether it's compatible with the latest ES and Lucene
dedupeio/dedupe (Python)
Comments: Uses Data grouping method. but It's written in Python.
JedAIToolkit (Java)
Comments: Uses Data grouping method.
Zentity (Elasticsearch Plugin)
Comments: It's a good one. Need to check whether it supports deduplication. So far in the document, it says about entity identity resolution.
Python Record Linkage Toolkit Documentation
Comments: It is in Python.
bakdata/dedupe (Java)
Comments: Not having clear documentation on how to use
I was wondering if anybody else had any others. Also please pour pros and cons of the above.
I have some pre-trained word2vec model and I'd like to evaluate them using the same corpus. Is there a way I could get the raw training loss given a model dump file and the corpus in memory?
The training-loss reporting of gensim's Word2Vec (& related models) is a newish feature that doesn't quite yet work the way most people expect.
For example, at least through gensim 3.7.1 (January 2019), you can just retrieve the total loss since the last call to train() (across multiple epochs). Some pending changes may eventually change that.
The loss-tallying is only done if requested when the model is created, via the compute_loss parameter. So if the model wasn't initially configured with this setting there will be no loss data inside it about prior training.
You could presumably tamper with the loaded model, w2v_model.compute_loss = False, so that further calls to train() (with the same or new data) would collect loss data. However, note that such training will also be updating the model, with respect the current data.
You could also look at the score() method, available for some model modes, which reports a loss-related number for batches of new texts, without changing the model. It may essentially work as a way to assess whether new texts "seem like" the original training data. See the method docs, including links to the motivating academic paper and an example notebook, for more info:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.score
AutoML makes two learners, one that includes "all" and the other that is a subset that is "best of family".
Is there any way to not-manually save the components and stacked ensemble aggregator to disk so that that "best of family", treated as a standalone black-box, can be stored, reloaded, and used without requiring literally 1000 less valuable learners to exist in the same space?
If so, how do I do that?
While running AutoML everything runs in memory (nothing is saved to disk unless you save one of the models to disk - or apply the option of saving an object to disk).
If you just want the "Best of Family" stacked ensemble, all you have to do is save that binary model. When you save a stacked ensemble, it saves all the required pieces (base models and meta model) for you. Then you can re-load later for use with another H2O cluster when you're ready to make predictions (just make sure, if you are saving a binary model, that you can use the same version of H2O later on).
Python Example:
bestoffamily = h2o.get_model('StackedEnsemble_BestOfFamily_0_AutoML_20171121_012135')
h2o.save_model(bestoffamily, path = "/home/users/me/mymodel")
R Example:
bestoffamily <- h2o.getModel('StackedEnsemble_BestOfFamily_0_AutoML_20171121_012135')
h2o.saveModel(bestoffamily, path = "/home/users/me/mymodel")
Later on, you re-load the stacked ensemble into memory using h2o.load_model() in Python or h2o.loadModel() in R.
Alternatively, instead of using an H2O binary model, which requires an H2O cluster to be running at prediction time, you can use a MOJO model (different model format). It's a bit more work to use MOJOs, though they are faster and designed for production use. If you want to save a MOJO model instead, then you can use h2o.save_mojo() in Python or h2o.saveMojo() in R.
I'm using the Stanford Named Entity toolkit with social media streams. However using that huge number of documents/sentences, I need to enhance the running time performance of the recognizer/classifier. I was wondering what are some techniques that I could do in order to solve this problem.
I need to mention that I only need to recognize one class of entities, organization.
corenlp.sh takes a threads parameter