While training the stanford sentiment model for a given dataset, we are using the command:
java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz
Is it possible to train without using dev.txt? What is its significance?
Yes, it is possible to train without using development data. The development data is used to evaluate the model on "unseen" test data, to guess at how well the final model will generalize to new inputs.
If you don't provide any development data, you won't be able to get any feedback during training on the performance of your model. (You can still take the saved models and test on new data manually.)
Related
update
Can I split the small test set into a validation set realB-v and a test set realB-t, then I fine-tune the model and test on the test set realB-v. Then, I swap the validation set and the test set and train a new model. Can I report the average results on two trainings?
original post
I have a pre-trained model M trained on the real dataset realA, I test it on another real dataset realB and get very poor results because realA and realB have domain gaps. Since real images in realB are difficult to acquire, I decide to generate synthetic images like realB and use these images syntheticA to fine-tune the model M.
I wonder if I still need to get a validation set? If so, the validation set should be splitted from syntheticA or realB? realB is already a very small set (300 images).
In my view, I don't think a validation set in this case is necessary. If I directly fine-tune the model and get hyperparameters according to the accuracy rate on realB, it won't cause generalization problems because the images I use for fine-tuning are all synthetic.
I'd like to hear your views. Thank you.
As H2o models are only reusable with the same major version of h2o they were saved with, an alternative is to save the model as MOJO/POJO format. Is there a way these saved models can be reused/loaded from python code. Or is there any way to keep the model for further development when upgrading the H2O version??
If you want to use your model for scoring via python, you could use either h2o.mojo_predict_pandas or h2o.mojo_predict_csv. But otherwise if you want to load a binary model that you previously saved, you will need to have compatible versions.
Outside of H2O-3 you can look into pyjnius as Tom recommended: https://github.com/kivy/pyjnius
Another alternative is to use pysparkling, if you only need it for scoring:
from pysparkling.ml import H2OMOJOModel
# Load test data to predict
df = spark.read.parquet(test_data_path)
# Load mojo model
mojo = H2OMOJOModel.createFromMojo(mojo_path)
# Make predictions
predictions = mojo.transform(df)
# Show predictions with ground truth (y_true and y_pred)
predictions.select('your_target_column', 'prediction').show()
I intend to use the trained xgboost model with tree_method='exact' in the SparkML pipeline so I need to use XGBoost4J-Spark; however documentation says "Distributed and external memory version only support approximate algorithm." (https://xgboost.readthedocs.io/en/latest//parameter.html). Is there anyway to work around this?
Alternatively, I can train the model with C-based xgboost and some how convert the trained model to XGBoostEstimator which is a SparkML estimator and seamless to integrate in SparkML pipeline. Has anyone came across such a convertor?
I don't mind running on a single node instead of a cluster as I can afford to wait.
Any insights is appreciated.
So there is this way:
import ml.dmlc.xgboost4j.scala.XGBoost
val xgb1 = XGBoost.loadModel("xgb1")
import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel
val xgbSpark = new XGBoostRegressionModel(xgb1)
where xgb1 is the model trained with C-based xgboost. There is a problem however; their predictions don't match. I have reported the issue on the github repo: https://github.com/dmlc/xgboost/issues/3190
I am using pycrfsuite now.
I Know crf training model's saving.
crf_trainer = pycrfsuite.Trainer()
crf_trainer.train('crf.crfsuite')
So, When I want to tag, i use the source.
crf_tagger = pycrfsuite.Tagger()
crf_tagger.open('crf.crfsuite')
But, I don't know how to recall the saved models for more training.
I have a Product Model having deals of stores (another Model Store) in whole city. Now if someone selects particular store I want my view to display deals of all stores in geographically nearby areas of that store (say within range of 3 miles).
One way would be finding all deals on zipcode basis. But wondering if there is any better way to do this. Maybe some gem..
Thanks.
Use geokit gem: http://geokit.rubyforge.org/ . Example:
Store.find(:all, :origin =>[37.792,-122.393], :within=>10)
If works with relational database. However, it is not optimized like Geo spatial databases.
What you're looking for is a spatial database. You can achieve this with Postgres via PostGIS. I'd also highly recommend using GeoServer or MapServer as a front-end to PostGIS. You're going to want to do some serious reading on GIS in general. This is not a topic to cover in a single answer. You may want to spend some time poking around the OSGeo site.
If you're feeling trendy, you can use MongoDB's spatial indexes. This is probably what I would recommend if you're looking for a quick fix. FourSquare actually runs entirely on MongoDB's spatial functionality. It's what they use to find people close-by. So with Mongo you could find nearby deals with something like
db.deals.find({
loc: {
$near: [YOUR_X, YOUR_Y],
$maxDistance : DEAL_DISTANCE
}
});
This will return all deals that are within DEAL_DISTANCE of your coordinates.