I try to evaluate my dynamic topic models.
The model were generated with the gensim wrappers.
Are there any possible functions like perplexity or topic coherence equal to the "normal" topic modeling?
Yeah there is topic coherence and perplexity for Gensim Wrapper:
# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=processed_docs, dictionary=dictionary, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)
You can check out this article for more information: 14
I hope this helps :)
Related
I need to use the interaction variable feature of multiclass classification in H2OGradientBoostingEstimator in H2O in Python. I am not sure which parameter to use & how to use that. Can anyone please help me out with this?
Currently, I am using the below code -
pros_gbm = H2OGradientBoostingEstimator(nfolds=0,seed=1234, keep_cross_validation_predictions = False, ntrees=10, max_depth=3, learn_rate=0.01, distribution='multinomial')
hist_gbm = pros_gbm.train(x=predictors, y=target, training_frame=hf_train, validation_frame = hf_test,verbose=True)
GBM inherently creates interactions. You can extract information about feature interactions using the .feature_interaction() extractor method (for an H2O Model). More information is provided in the user guide and the Python docs.
If you want to explicitly add a new column that is the interaction between two numerics, you could create that manually by multiplying the two (or more) columns together to get a new interaction column.
For categorical interactions, there's also the the h2o.interaction() method in Python here to create interaction columns in the data (prior to sending it to the GBM or any algorithm).
after running automl (classification of 3 classes), I can see a list of models as follows:
model_id mean_per_class_error
StackedEnsemble_BestOfFamily_0_AutoML_20180420_174925 0.262355
StackedEnsemble_AllModels_0_AutoML_20180420_174925 0.262355
XRT_0_AutoML_20180420_174925 0.266606
DRF_0_AutoML_20180420_174925 0.278428
GLM_grid_0_AutoML_20180420_174925_model_0 0.442917
but mean_per_class_error is not a good metric for my case, where classes are unbalanced (one class has very small population). How to fetch details of non-leader models and calculate other metrics? Thanks.
python version: 3.6.0
h2o version: 3.18.0.5
actually just figured this out myself (assuming aml is the h2o automl object after training):
for m in aml.leaderboard.as_data_frame()['model_id']:
print(m)
print(h2o.get_model(m))
You can also grab the corresponding model you're interested in using the following line:
model6 = h2o.get_model(aml.leaderboard.as_data_frame()['model_id'][6])
where 6 is the index number of the model in the leaderboard.
I intend to use the trained xgboost model with tree_method='exact' in the SparkML pipeline so I need to use XGBoost4J-Spark; however documentation says "Distributed and external memory version only support approximate algorithm." (https://xgboost.readthedocs.io/en/latest//parameter.html). Is there anyway to work around this?
Alternatively, I can train the model with C-based xgboost and some how convert the trained model to XGBoostEstimator which is a SparkML estimator and seamless to integrate in SparkML pipeline. Has anyone came across such a convertor?
I don't mind running on a single node instead of a cluster as I can afford to wait.
Any insights is appreciated.
So there is this way:
import ml.dmlc.xgboost4j.scala.XGBoost
val xgb1 = XGBoost.loadModel("xgb1")
import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel
val xgbSpark = new XGBoostRegressionModel(xgb1)
where xgb1 is the model trained with C-based xgboost. There is a problem however; their predictions don't match. I have reported the issue on the github repo: https://github.com/dmlc/xgboost/issues/3190
I do a search. I narrow by field A. I narrow by field B. I get results that include burlap AND sack. What I want is to get results that include burlap OR sack.
sqs = sqs.narrow(fieldA='burlap')
sqs = sqs.narrow(fieldB='sack')
You can do some level of OR narrowing with the following:
sqs = sqs.narrow(fieldA=('burlap' or 'tweed' or 'plastic'))
sqs = sqs.narrow(fieldB='sack')
But you still end up with results with burlap AND sack. An alternative to this method is the following, but it is not ideal since it seems to be slow on large data sets:
sqs = sqs.filter_or(fieldA='burlap')
sqs = sqs.filter_or(fieldB='sack')
Where is Daniel Lindsay when you need him?
YMMV -- the docs (http://django-haystack.readthedocs.org/en/latest/searchqueryset_api.html#narrow) point out that this method is not portable between backends and that the syntax depends on the backend. The example in that section even has a lucene looking "SearchQuerySet().narrow('title:smoothie')" example.
In the source it looks like haystack pretty trustingly passes whatever you have as your narrow argument to the back end. You didn't say what backend you are using, but maybe something like this would get you the fq you want in solr:
sqs = sqs.narrow('fieldA:burlap OR fieldB:sack')
Filter_or is a different animal than narrow, at least with solr. Filter_or will add that clause to the main query, resulting in a different set of results, different scoring, etc. Narrow will create a filter query. This is instead used to filter your original results (shocking, right?) and it can be cached, which can help performance if you're going to be using that filter a lot.
D'oh, I typed all that stuff and still don't know where Daniel Lindsay is.
I have a Product Model having deals of stores (another Model Store) in whole city. Now if someone selects particular store I want my view to display deals of all stores in geographically nearby areas of that store (say within range of 3 miles).
One way would be finding all deals on zipcode basis. But wondering if there is any better way to do this. Maybe some gem..
Thanks.
Use geokit gem: http://geokit.rubyforge.org/ . Example:
Store.find(:all, :origin =>[37.792,-122.393], :within=>10)
If works with relational database. However, it is not optimized like Geo spatial databases.
What you're looking for is a spatial database. You can achieve this with Postgres via PostGIS. I'd also highly recommend using GeoServer or MapServer as a front-end to PostGIS. You're going to want to do some serious reading on GIS in general. This is not a topic to cover in a single answer. You may want to spend some time poking around the OSGeo site.
If you're feeling trendy, you can use MongoDB's spatial indexes. This is probably what I would recommend if you're looking for a quick fix. FourSquare actually runs entirely on MongoDB's spatial functionality. It's what they use to find people close-by. So with Mongo you could find nearby deals with something like
db.deals.find({
loc: {
$near: [YOUR_X, YOUR_Y],
$maxDistance : DEAL_DISTANCE
}
});
This will return all deals that are within DEAL_DISTANCE of your coordinates.