How can i implement LDA using apache mahout? - hadoop

have a data set like as bellow in CSV format.
FileName,Topic,Tag,Frequency
File-1,Topic -1,Tag-1,10
File-2,Topic -2,Tag-2,10
File-3,Topic -3,Tag-2,10
File-4,Topic -4,Tag-4,10
File-5,Topic -1,Tag-5,10
File-6,Topic -3,Tag-1,10
File-7,Topic -1,Tag-1,10
I need to find a correlation between the tags using mahout LDA(Latent Dirichlet allocation) algorithm. Can anybody please help me to find how to do that using Apache Mahout.
I am also confused that in exactly what input format mahout wants ?
It will be helpful if somebody please share some good stuff for mahout beginner

I might be late in answering. But, Mahout no longer supports LDA for versions above 0.6 . One has to use Cvb instead of lda to accomplish the task of running topic models.
The following links can help You:
https://mahout.apache.org/users/clustering/lda-commandline.html
https://mahout.apache.org/users/clustering/latent-dirichlet-allocation.html

Related

What is the algorithm behind Rasa NLU?

I see the Rasa NLU use the MITIE and spaCy, but can anyone explain the how they use it and the algorithm behind?
There is a post by Alan on the Rasa blog here that covers the basic approach used:
https://medium.com/rasa-blog/do-it-yourself-nlp-for-bot-developers-2e2da2817f3d
This should give a good idea of roughly what it's doing but if you are keen to find out more, you can easily look over the actual code used (which is the great advantage of open source solutions!) https://github.com/RasaHQ/rasa_nlu/tree/master/rasa_nlu
It depends what kind of NER you want to use for your bot.. basically you define a pipeline in your configuration file ... most preferred is spacy since its corpus is being updated regularly and widely used .. mitie is not that good as compare to spacy and also is an older version.
language: "en"
pipeline: "spacy_sklearn"
you can read in more details here :
choosing rasa nlu pipeline

What is a good link to examples of enaml being used with traits and matplotlib?

I have done GUI construction but not in Python. From other stack exchange questions and my own investigation. It looks like I want to use enaml and traits for the bulk of this work. Are there any links or references to help me get started.
This is a scientific application integrating matplotlib plots and text boxes and buttons (Very simple I think). I have gone through this example but don't understand it too well http://code.enthought.com/projects/traits/docs/html/tutorials/traits_ui_scientific_app.html
I have also gone through the Enthough Chaco examples and don't get very far. Has somebody built a program that I could run and look at their code? Or is their a repository of examples I am not aware of? I found the enaml examples but the example with matplotlib is basic and does not show me how to connect my algorithms to the plots. Thanks in advance!
Not a full answer, but for additional context:
1) Use https://github.com/nucleic/enaml, along with https://github.com/enthought/traits-enaml
2) Example:
https://github.com/nucleic/enaml/blob/master/examples/widgets/mpl_canvas.enaml

Reading and writing to hadoop sequence file using scala

I just started using scalding and trying to find examples of reading a text file and writing to a hadoop sequence file.
Any help is appreciated.
You can use com.twitter.scalding.WritableSequenceFile (please note that you have to use the fully quantified name, otherwise it picks up the cascading one). Hope this helps.

Where could I find an implementation of SVM on Hadoop?

I found an implementation in http://code.google.com/p/cascadesvm/.
However, there are no specifications about that. Has anyone tried that? Or where could I find an alternative implementation of SVM on Hadoop?
Thanks a lot~
Looks like someone did this within the Mahout project, not sure if it's been merged into trunk, but this looks like a good place to start:
https://issues.apache.org/jira/browse/MAHOUT-232
You can check it out https://code.google.com/p/cascadesvm/
The training part, and a demo in Matlab version are released.
https://code.google.com/p/cascadesvm/wiki/CascadeSVMMatlabVersion

Pulling stats out of a text

I'd like to know what are the most recurrent in a given text or group of text (pulled from a database) in ruby.
Does anyone know what are the best practices?
You might start with statistical natural language processing. Also, you may be able to leverage one or more of the libraries mentioned on the AI Ruby Plugins page.

Resources