Reusable Data Mining Code? - algorithm

I was assigned an extra credit project where I'm given a set of data and told to run data mining algorithms on them. (Supposed to choose two of Apriori, PART, RIPPER, and J48).
We're told to reuse code for the algorithm implementation of this project, but the only code I can find is example code and can't easily be run on the data set given to us for the project. Does anybody know of premade java data mining algorithms that I can use on my data?

I would start by looking at http://www.cs.waikato.ac.nz/ml/weka/. Websearches suggest that it supports the first two options I checked, Apriori and J48. There is quite reasonable Java source code, but also a GUI interface. You may be able to do what you need just by writing code to produce stuff in a format Weka can read and then working within the Weka GUI.

Related

How to quickly prepare rasa training data

I am going to build a chat bot from scratch with rasa.The biggest difficulty now is how to automate production training data.Training data includes nlu.md and stories.md .
I have tried rasa-nlu-trainer and Chatito,But there are still a lot of manual operations,If there are tens of thousands of corpora in the future.How to mark the data to make the data meet the data format of nlu.md and stories.md
Is there an automated tool or program to do this? Thanks a lot!
Well, if you're doing anything ML related, your data is the most important thing that you'll need for the model to learn from. And because we want the model to learn from that data, we create the data and then train the model with it. What you're asking for is for something to somehow create the data for it. It's precisely because there doesn't exist anything like that that we create datasets to train the AI on, by ourselves, so that the model learns form it. So, if you automate the data creation process, what do you expect the model to learn?
So, you can't create the data automatically because if that were possible, we would already have had Artificial General Intelligence (AGI) by now.
But if your goal is to just format the data then you can just write a script for that.

Returnn Switchboard data processing

Could anybody give me pointers on how to process Switchboard dataset for training with RETURNN? I did see BlissDataset class that seems to be designed for switchboard, but it's not clear to me what I should include in the paths given in the example:
Example:
./tools/dump-dataset.py "
{'class':'BlissDataset',
'path': '/u/tuske/work/ASR/switchboard/corpus/xml/train.corpus.gz',
'bpe_file': '/u/zeyer/setups/switchboard/subwords/swb-bpe-codes',
'vocab_file': '/u/zeyer/setups/switchboard/subwords/swb-vocab'}"
The switchboard dataset has several folders with audios, i.e. swb1_d2/data/*.sph and transcripts swb1_LDC97S62/swb_ms98_transcriptions/**/*
I'm not quite sure how to proceed with this to get a dataset that can be used to train RETURNN.
At our group (RWTH Aachen University), we use the config as it was published on GitHub. As you see, this one uses ExternSprintDataset. That dataset uses
The implementation uses Sprint (publicly called RWTH ASR (RASR), see here) as an external tool (ran in a subprocess) to handle the data (feature extraction, etc). Sprint gets a Bliss XML file which describes all the segments with path to audio and audio offsets and transcriptions, and also it gets further configs for the feature extraction and maybe other things. There is an open source version of RASR which should work but it might be a bit involved to get this to work.
The BlissDataset was planned to be a simpler replacement for that. However, the implementation is incomplete. Also, you still would need to generate the Bliss XML by yourself in some way (we have used some own internal scripts to prepare that based on the official LDC data).
So, unfortunately, there is no simple way yet. Actually, I think the easiest way would be to come up with yet another custom format, which might be similar to the LibriSpeechDataset implementation, or maybe just the same, and then you could just reuse LibriSpeechDataset, or at least parts of that. That dataset implementation takes the data in some zip format which contains the transcripts in txt files and the audio in ogg or wav files. It uses librosa to do MFCC feature extraction (or also other feature types). I planned to implement that for Switchboard, and then reproduce the results, however I did not have time yet and not sure when I will get to that. But if you want to try that on your own, I will be happy to help you however I can. The starting point would be to look at LibriSpeechDataset and understand how the format of that looks like.

Is there a way to do collaborative learning with h2o.ai (flow)?

Relatively new to ML and h2o. Is there a way to do collaborative learning/training with h2o? Would prefer a way that uses the flow UI, else woud be using python.
My use case is that there would be new feature samples x=[a, b, c, d] periodically coming into a system where an h2o algorithm (say, running from a java program using a MOJO) assigns a binary class that users should be able to manually reclassify as either good(0) or bad(1), at which point these samples (with their newly assigned responses) get sent back to theh h2o algorithm to be used to further train it.
Thanks
FLOW UI is great for prototyping something very quick with H2O without writing a single like of code. You can ingest the data, build desired model and the evaluate the results. Unfortunately FLOW UI is can not be extended for the reason you asked, and FLOW is limited for that reason.
For collaborative learning you can write your whole application directly in python or R and it will work as expected.

Cross Validation--Use testing set or validation set to predict?

I have a question about cross validation.
In Machine learning, we know there're training, validation, test set.
And test set is final run to see how the final model/classifier performed.
But in the process of cross validation:
we are splitting data into training set and testing set(most tutorial used this term), so I'm confused. Do we need to split the whole data into 3 parts: training, validation, test? Since in cross validation we just keep talking about relationship with 2 set: training and the other.
Could someone help clarify?
Thanks
Yep ,it's a little confusing as some material uses CV/test interchangeably and some material does not use ,but i'll try to make it easy to understand by giving the comprehension of why it's needed:
You need the train set to do exactly that, train, but then also you need a way to ensure that your algorithm isn't memorizing the train set(that it's not overfitting) and how well its doing, so that makes the need of the test set so you can give it data it has never seen and you can measure the performance.
But.... ML its all about experimentation, you will train, evaluate, tweak some knob(hyperparameters or architectures), train again, evaluate again over and over, and then you will select the best experiment results, you deploy your system and in production it gets data it's never seen and it doesn't perform that well ,what happened? You used your test data to fit parameters and make decisions , so you overfitted to this test data but you dont know how it does to data never seen.
Cross validation solves this, you have your train data to learn parameters, and test data to evaluate how it does on unseen data, but still need a way to experiment the best hyper parameters and architectures: you take a sample of your training data and call it cross validation set, and hide your test data , you will NEVER use it until the end.
Now use your train data to learn parameters, and experiment with hyper parameters and architectures, but you will evaluate each experiment on the cross validation data instead of test data(you can see it as using CV data as a way to learn the hyperparameters) , after you experimented a lot, and selected your best performing option(on CV), you now use your test data to evaluate how it performs on data it has never seen before deploying it to production.
This is generally an either-or choice. The process of cross-validation is, by design, another way to validate the model. You don't need a separate validation set -- the interactions of the various train-test partitions replace the need for a validation set.
Think about the name, cross-validation ... :-)

Where is Pentaho Kettle's architecture?

Where can I find Pentaho Kettle architecture? I'm looking for a short wiki, design document, blog post, anything to give a good overview on how things work. This question is not meant for specific "how to" starting guides but rather a good view at the technology and architecture.
Specific questions I have are:
How does data flow between steps? It would seem everything is in memory - am I right about this?
Is the above true about different transformations as well?
How are the Collect steps implemented?
Any specific performence guidelines to using it?
Is the ftp task reliable and performant?
Any other "Dos and Don'ts" ?
See this PDF.
How does data flow between steps? It would seem everything is in
memory - am I right about this?
Data flow is row-based. For transformation every step produce a 'tuple' or a row with fields. Every field is pair of data and a metadata. Every step has input and output. Step takes rows from input, modify rows and send rows to outputs. For most cases every all information is in memory. But. Steps reads data in streaming fashion (like jdbc or other) - so typically in memory only a part of data from a stream.
Is the above true about different transformations as well?
There is a 'job' concept and 'transformation' concept. All written above is mostly true for transformation. Mostly - means transformation can contain very different steps, some of them - like collect steps - can try to collect all data from a stream. Jobs - is a way to perform some actions that do not follow 'streaming' concept - like send email on success, load some files from net, execute different transformations one by one.
How are the Collect steps implemented?
It only depend on particular step. Typically as said above - collect steps may try to collect all data from stream - having so - can be a reason of OutOfMemory exceptions. If data is too big - consider replace 'collect' steps with different approach to process data (for example use steps that do not collect all data).
Any specific performence guidelines to using it?
A lot of. Depends on steps transformation is consists, sources of data used. I would try to speak on exact scenario rather then general guidelines.
Is the ftp task reliable and performant?
As far as I remember ftp is backed by EdtFTP implementation, and there may be some issues with that steps like - some parameters not saved, or http-ftp proxy not working or other. I would say Kettle in general is reliable and perfomant - but for some not commonly used scenarios - it can be not so.
Any other "Dos and Don'ts" ?
I would say the Do - is to understand a tool before starting use it intensively. As mentioned in this discussion - there is a couple of literature on Kettle/Pentaho Data Integration you can try search for it on specific sites.
One of advantages of Pentaho Data Integration/Kettle is relatively big community you can ask for specific aspects.
http://forums.pentaho.com/
https://help.pentaho.com/Documentation

Resources