Parallel Textpreprocessing with Pyro - gensim

I am using distributed Gensim for Topic Modeling of documents. However, the preprocessing part is not distributed. Therefore I would like to implement a distributed preprocessing of the text data with Pyro (since distributed Gensim is also based on Pyro). Are there any examples how to use Pyro for this task?


LightGBM: Intent of lightgbm.dataset()

What is the purpose of lightgbm.Dataset() as per the docs when I can use the sklearn API to feed the data and train a model?
Any real world examples explaining the usage of lightgbm.dataset() would be interesting to learn?
LightGBM uses a few techniques to speed up training which require preprocessing one time before training starts.
The most important of these is bucketing continuous features into histograms. When LightGBM searches splits to possibly add to a tree, it only searches the boundaries of these histogram bins. This greatly reduces the number of splits to evaluate.
I think this picture from "What Makes LightGBM Fast?" describes it well:
The Dataset object in the library is where this preprocessing happens. Histograms are created one time, and then don't need to be calculated again for the rest of training.
You can get some more information about what happens in the Dataset object by looking at the parameters that control that Dataset, available at Some examples of other tasks:
optimization for sparse features
filtering out features that are not splittable
when I can use the sklearn API to feed the data and train a model
The lightgbm.sklearn interface is intended to make it easy to use LightGBM alongside other libraries like xgboost and scikit-learn. It takes in data in formats like scipy sparse matrices, pandas data frames, and numpy arrays to be compatible with those other libraries. Internally, LightGBM constructs a Dataset from those inputs.

Recommender system using pig or mahout

I am building a recommend system on Hadoop in a simple way can u give me an opinion on what to use to build this recommendation system.
I would like to use Apache pig or Apache mahout.
In my data set i am having
i have my data in c.s.v format
so can you please suggest me which technology to use to produce item based and user based recommendation system.
Apache Mahout will provide you with a off-shelf recommendation engine based on collaborative filtering algorithms.
With Pig you will have to implement those algorithms yourself - in Pig Latin, which may be a rather complex task.
I know it's not one of your preferred methods, but another product you can use on Hadoop to create a recommendation engine is Oryx.
Oryx was created by Sean Owen (co-author of the book Mahout in Action, and a major contributor to the Mahout code base). It only has 3 algorithms at the moment (Alternating Least Squares, K-Means Clustering, and Random Decision Forests), but the ALS algorithm provides a fairly easy to use Collaborative Filtering engine, sitting on top of the Hadoop infrastructure.
From the brief description of your dataset, it sounds like it would would perfectly. It has a model generation engine (computational layer), and it can generate a new model based on one of 3 criteria:
1) Age (time between model generations)
2) Number of records added
3) Amount of data added
Once a generation of data has been built, there's another java daemon that runs (the service layer), which will serve out the recommendations (user to item, item to item, blind recommendations, etc) via a RESTful API. When a new generation of the model is created, it will automatically pick up that generation and serve it out.
There are some nice features in the model generation as well, such as aging historic data, which can help get around issues like seasonality (probably not a big deal if you're talking about books, though).
The computational layer (model generation) uses HDFS to store/lookup data, and uses MapReduce or YARN for job control. The serving layer is a daemon that can run on each data node, and it accesses the HDFS filesystem for the computed model data to present out over the API.

Where does map-reduce/hadoop come in in machine learning training?

Map-reduce/hadoop is perfect in gathering insights from piles of data from various resources, and organize them in a way we want it to be.
But when it comes to training, my impression is that we have to dump all the training data into algorithm (be it SVN, Logistic regression, or random forest) all at once so that the algorithm is able to come up with a model that has it all. Can map-reduce/hadoop help in the training part? If yes, how in general?
Yes. There are many MapReduce implementations such as hadoop streaming and even some easy tools like Pig, which can be used for learning. In addition, there are distributed learning toolset built upon Map/Reduce such as vowpal wabbit ( The big idea of this kind of methods is to do training on small portion of data (split by HDFS) and then averaging the models and commutation with each nodes. So the model get updates directly from submodels built on part of the data.

Unsupervised automatic tagging algorithms?

I want to build a web application that lets users upload documents, videos, images, music, and then give them an ability to search them. Think of it as Dropbox + Semantic Search.
When user uploads a new file, e.g. Document1.docx, how could I automatically generate tags based on the content of the file? In other words no user input is needed to determine what the file is about. If suppose that Document1.docx is a research paper on data mining, then when user searches for data mining, or research paper, or document1, that file should be returned in search results, since data mining and research paper will most likely be potential auto-generated tags for that given document.
1. Which algorithms would you recommend for this problem?
2. Is there an natural language library that could do this for me?
3. Which machine learning techniques should I look into to improve tagging precision?
4. How could I extend this to video and image automatic tagging?
Thanks in advance!
The most common unsupervised machine learning model for this type of task is Latent Dirichlet Allocation (LDA). This model automatically infers a collection of topics over a corpus of documents based on the words in those documents. Running LDA on your set of documents would assign words with probability to certain topics when you search for them, and then you could retrieve the documents with the highest probabilities to be relevant to that word.
There have been some extensions to images and music as well, see
LDA has several efficient implementations in several languages:
many implementations from the original researchers, written in Java and recommended by others on SO
PLDA: a fast, parallelized C++ implementation
These guys propose an alternative to LDA.
Automatic Tag Recommendation Algorithms for
Social Recommender Systems
Haven't read thru the whole paper but they have two algorithms:
Supervised learning version. This isn't that bad. You can use Wikipedia to train the algorithm
"Prototype" version. Haven't had a chance to go thru this but this is what they recommend
UPDATE: I've researched this some more and I've found another approach. Basically, it's a two-stage approach that's very simple to understand and implement. While too slow for 100,000s of documents, it (probably) has good performance for 1000s of docs (so it's perfect for tagging a single user's documents). I'm going to try this approach and will report back on performance/usability.
In the mean time, here's the approach:
Use TextRank as per to generate a tag list for a single document. This generates a tag list for a single document independent of other documents.
Use the algorithm from "Using Machine Learning to Support Continuous
Ontology Development" ( to integrate the tag list (from step 1) into the existing tag list.
Text documents can be tagged using this keyphrase extraction algorithm/package.
Currently it supports limited type of documents (Agricultural and medical I guess) but you can train it according to your requirements.
I'm not sure how would the image/video part work out, unless you're doing very accurate object detection (which has it's own shortcomings). How are you planning to do it ?
You want Doc-Tags ( which is a commercial product that automatically and Unsupervised - generates Contextually Accurate Document Tags. The built-in Reporting functionality makes the product a light-weight document management system.
For Developers wanting to customize their own approach - the source code is available (very cheap) and the back-end service xAIgent ( is very inexpensive to use.
I posted a blog article today to answer your question.
There are basically two approaches to automatically extract keywords from images and videos.
Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
Thanks, Scott

Exporting Mahout model output as Weka input

I would like to use the output model of a Mahout decision tree training process as the input model for a Weka based classifier.
As the training of a complex decision tree that is based on millions of training records is almost impractical for a single node Weka classifier, I would like to use Mahout to build the model, using for example Random Forest Partial Implementation.
While the algorithm above can be problematic while training, it is rather simple to use it for prediction with Weka on a single machine.
On Mahout wiki site it is stated that the data formats for import include Weka ARFF format, but not for export.
Is it possible to use some of the existing implementations in Mahout to train models that will be used in production with a simple Weka based system?
I don't think it's possible to do what you're asking: .arff is a data format, as are all of the other options in the import/export menus. The classifiers that Weka can save/load are, in fact, Weka's java Classifier objects written to a file using Java's Serializable interface. They're not so much portable trees as they are Java objects that last longer than the JVMs which create them. Thus, to do what you want, either Mahout or Weka would have to be able to produce/read each other's code, and that's not something I can find any documentation of.
My experience is that with several million training records (consisting of ~45 numeric features/columns each), Weka's Random Forest implementation using the default options is very fast (operating in seconds on a single 2.26GHz core), so it may not be necessary to bother with Mahout. Your data set may well have different results, though.
