PyTorch distributed dataLoader - parallel-processing

Any recommended ways to make PyTorch DataLoader (torch.utils.data.DataLoader) work in distributed environment, single machine and multiple machines? Can it be done without DistributedDataParallel?

Maybe you need to make your question clear. DistributedDataParallel is abbreviated as DDP, you need to train a model with DDP in a distributed environment. This question seems to ask how to arrange the dataset loading process for distributed training.
First of all,
data.Dataloader is proper for both dist and non-dist training, usually, there is no need to do something on that.
But the sampling strategy varies in this two modes, you need to specify a sampler for the dataloader(the sampler arg in data.Dataloader), adopting torch.utils.data.distributed.DistributedSampler is the simplest way.

Related

Gensim LDA: coherence values not reproducible between runs

I used this code, https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/, to find topic coherence for a dataset. When I tried this code with the same number of topics, I got new values after each running. For example, for the number of topics =10, I got the following value after 2 running:
First Run for the number of topics =10
Coherence Score CV_1: 0.31230269562327095
Coherence Score UMASS_1: -3.3065236823786064
Second Run the number of topics =10
Coherence Score CV_2: 0.277016662550274
Coherence Score UMASS_2: -3.6146150653617743
What is the reason? In this unstable case, how we can trust this library? The highest coherence value changed as well.
TL;DR: coherence is not "stable" -i.e. reproducible between runs - in this case because of fundamental LDA properties. You can make LDA reproducible by setting random seeds and PYTHONHASHSEED=0. You can take other steps to improve your results.
Long Version:
This is not a bug, it's a feature.
It is less a question of trust in the library, but an understanding of the methods involved. The scikit-learn library also has an LDA implementation, and theirs will also give you different results on each run. But by its very nature, LDA is a generative probabilistic method. Simplifying a little bit here, each time you use it, many Dirichlet distributions are generated, followed by inference steps. These steps and distribution generation depend on random number generators. Random number generators, by their definition, generate random stuff, so each model is slightly different. So calculating the coherence of these models will give you different results every time.
But that doesn't mean the library is worthless. It is a very powerful library that is used by many companies (Amazon and Cisco, for example) and academics (NIH, countless researchers) - to quote from gensim's About page:
By now, Gensim is—to my knowledge—the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modelling from plain text.
If that is what you want, gensim is the way to go - certainly not the only way to go (tmtoolkit or sklearn also have LDA) but a pretty good choice of paths. That being said, there are ways to ensure reproducability between model runs.
Gensim Reproducability
Set PYTHONHASHSEED=0
From the Python documentation: "On Python 3.3 and greater, hash randomization is turned on by default."
Use random_state in your model specification
Afaik, all of the gensim methods have a way of specifying the random seed to be used. Choose any number you like, but the default value of zero ("off") and use the same number for each rerun - this ensures that the same input into the random number generators always results in the same output (gensim ldamodel documentation).
Use ldamodel.save() and ldamodel.load() for model persistency
This is also a very useful, timesaving step that keeps you from having to re-run your models every time you start (very important for long-running models).
Optimize your models and data
This doesn't technically make your models perfectly reproducable, but even without the random seed settings, you will see your model perform better (at the cost of computation time) if you increase iterationsor passes. Preprocessing also makes a big difference and is an art unto itself - do you choose to lemmatize or stem and why do you do so? This all can have important effects on the outputs and your interpretations.
Caveat: you must use one core only
Multicore methods (LdaMulticore and the distributed versions) are never 100% reproducible, because of the way the operating system handles multiprocessing.

Attribute selection in h2o

I am very beginner in h2o and I want to know if there is any attribute selection capabilities in h2o framework so to be applied in h2oframes?
No there are not currently feature selection functions in H2O -- my advice would be to use Lasso regression (in H2O this means use GLM with alpha = 1.0) to do the feature selection, or simply allow whatever machine learning algorithm (e.g. GBM) you are planning to use to use all the features (they'll tend to ignore the bad ones, but it could still degrade performance of the algorithm to have bad features in the training data).
If you'd like, you can make a feature request by filling out a ticket on the H2O-3 JIRA. This seems like a nice feature to have.
In my opinion, Yes
My way is use automl to train your data.
after training, you can get a lot of model.
use h2o.get_model method or H2O server page to watch some model you like.
you can get VARIABLE IMPORTANCES frame.
then pick your features.

What is tuning in machine learning? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I am a novice learner of machine learning and am confused by tuning.
What is the purpose of tuning in machine learning? To select the best parameters for an algorithm?
How does tuning works?
Without getting into a technical demonstration that would seem appropriate for Stackoverflow, here are some general thoughts. Essentially, one can argue that the ultimate goal of machine learning is to make a machine system that can automatically build models from data without requiring tedious and time consuming human involvement. As you recognize, one of the difficulties is that learning algorithms (eg. decision trees, random forests, clustering techniques, etc.) require you to set parameters before you use the models (or at least to set constraints on those parameters). How you set those parameters can depend on a whole host of factors. That said, your goal, is usually to set those parameters to so optimal values that enable you to complete a learning task in the best way possible. Thus, tuning an algorithm or machine learning technique, can be simply thought of as process which one goes through in which they optimize the parameters that impact the model in order to enable the algorithm to perform the best (once, of course you have defined what "best" actual is).
To make it more concrete, here are a few examples. If you take a machine learning algorithm for clustering like KNN, you will note that you, as the programmer, must specify the number of K's in your model (or centroids), that are used. How do you do this? You tune the model. There are many ways that you can do this. One of these can be trying many many different values of K for a model, and looking to understand how the inter and intra group error as you very the number of K's in your model.
As another example, let us consider say support vector machine (SVM) classication. SVM classification requires an initial learning phase in which the training data are used
to adjust the classication parameters. This really refers to an initial parameter tuning phase where you, as the programmer, might try to "tune" the models in order to achieve high quality results.
Now, you might be thinking that this process can be difficult, and you are right. In fact, because of the difficulty of determining what optimal model parameters are, some researchers use complex learning algorithms before experimenting adequately with simpler alternatives with better tuned parameters.
In the abstract sense of machine learning, tuning is working with / "learning from" variable data based on some parameters which have been identified to affect system performance as evaluated by some appropriate1 metric. Improved performance reveals which parameter settings are more favorable (tuned) or less favorable (untuned).
Translating this into common sense, tuning is essentially selecting the best parameters for an algorithm to optimize its performance given a working environment such as hardware, specific workloads, etc. And tuning in machine learning is an automated process for doing this.
For example, there is no such thing as a "perfect set" of optimizations for all deployments of an Apache web server. A sysadmin learns from the data "on the job" so to speak and optimizes their own Apache web server configuration as appropriate for its specific environment. Now imagine an automated process for doing this same thing, i.e., a system that can learn from data on its own, which is the definition of machine learning. A system that tunes its own parameters in such a data-based fashion would be an instance of tuning in machine learning.
1 System performance as mentioned here, can be many things, and is much more general than the computers themselves. Performance can be measured by minimizing the number of adjustments needed for a self-driving car to parallel park, or the number of false predictions in autocomplete; or it could be maximizing the time an average visitor spends on a website based on advertisement dimensions, or the number of in-app purchases in Candy Crush.
Cleverly defining what "performance" means in a way that is both meaningful and measurable is key in a successful machine learning system.
A little pedantic but just want to clarify that a parameter is something that is internal to the model (you do not set it). What you are referring to is a hyperparameter.
Different machine learning algorithms have a set of hyperparameters that can be adjusted to improve performance (or make it worse). The most common and maybe simplest way to find the best hyperparameter is through what's known as a grid search (searching across a set of values).
Some examples of hyperparameters include the number of trees for a random forest algorithm, or a value for regularization.
Important note: hyperparameters must be tuned on a separate set of training data. Lot's of new folks to machine learning will modify the hyperparameters on the training data set until they see the best performance on the test dataset. You are essentially overfitting the hyperparameter by doing this.

cluster analysis Hadoop, Map reduce environment

we are currently trying to create some very basic personas based on our user data base (few million profiles). The goal is to find out at this stage what the characteristics of our users are, for example what they look like and what they are looking for and to create several "typical" user profiles.
I believe the best way to achieve this would be to run a cluster analysis in order to find similarities among users.
The big roadblock however is how to get there. We are tracking our data in a Hadoop environment and I am being told that this could be potentially achieved with our tools.
I have familiarised myself with the theory of the topic and know that it can be done for example in SPSS (quite hard to use and limited to samples of large data sets).
The big question: Is it possible to perform a or different types of cluster analysis in a Hadoop environment and then visualise the results like in SPSS? It is my understanding that we would need to run several types of analysis in order to find the best way to cluster the data, also when it comes to distance measurements of the clusters.
I have not found any information on the internet with regards to this, so I wonder if this is possible at all, without a major programming effort (meaning literally implementing for example all the standard tools available in SPSS: Dendrograms, the different result tables and cluster graphs etc.).
Any input would be much appreaciated. Thanks.

Where does map-reduce/hadoop come in in machine learning training?

Map-reduce/hadoop is perfect in gathering insights from piles of data from various resources, and organize them in a way we want it to be.
But when it comes to training, my impression is that we have to dump all the training data into algorithm (be it SVN, Logistic regression, or random forest) all at once so that the algorithm is able to come up with a model that has it all. Can map-reduce/hadoop help in the training part? If yes, how in general?
Yes. There are many MapReduce implementations such as hadoop streaming and even some easy tools like Pig, which can be used for learning. In addition, there are distributed learning toolset built upon Map/Reduce such as vowpal wabbit (https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial). The big idea of this kind of methods is to do training on small portion of data (split by HDFS) and then averaging the models and commutation with each nodes. So the model get updates directly from submodels built on part of the data.

Resources