Machine Learning/Artificial Intelligence - Classify column based on the value / pattern - algorithm

I have been trying some frameworks and algorithms, and I can't find one that do what I want - which is classify the column of the data based on the value.
I tried to use Bayes algorithm, but it isn't very precise because I can't expect that the data that is being searched for is in the training set - but I can expect that the pattern is in the training.
I don't have background in Machine Learning / AI, but I was looking for some working example before really going deeper in the implementation.
I built a smaller ARFF to exemplify. Also tried lots of Weka classifying algorithms but none of them gave me good results.
#relation recommend
#attribute class {name,email,taxid,phone}
#attribute text String
#data
name,'Erik Kolh'
name,'Eric Candid'
name,'Allan Pavinan'
name,'Jubaru Guttenberg'
name,'Barabara Bere'
name,'Chuck Azul'
email,'erik#gmail.com'
email,'steven#spielberg.com'
email,'dogs#cats.com'
taxid,'123611216'
taxid,'123545413'
taxid,'562321677'
taxid,'671312678'
taxid,'123123216'
phone,'438-597-7427'
phone,'478-711-7678'
phone,'321-651-5468'
My expectation is train a huge dataset like the above one and get recommendations based on the pattern, e.g.:
joao#bing.com -> email
Joao Vitor -> name
400-123-5519 -> phone
Can you please suggest any algorithms, examples or ideas to research?
I couldn't find a good fit, maybe it's just lack of vocabulary.
Thank you!

What you are trying to do is called named entity recognition (NER). Weka is most likely not a real help here. The library Mallet (http://mallet.cs.umass.edu) might be a good fit. I would recommend a Conditional Random Field (CRF) based approach.
If you would like to stay with weka, you need to change your feature space. Then Naive bayes will be do ok on your data as presented
E.g. add a features for
whether the word has only characters
whether it is alphanumeric
whether it is numeric data
number of Numbers,
whether it starts captilized
... (just be creative)

Related

Contextual Search: Classifying shopping products

I have got a new task(not traditional) from my client, It is something about machine learning.
As I have never been to "machine learning" except some little Data Mining stuff so I need your help.
My task is to Classify a product present on any Shopping Site, on the basis of gender(whom the product belongs to),agegroup etc, the training data we can have is the product's Title, Keywords(available in the html of the product page), and product description.
I did a lot of R&D , I found Image Recog APIs(cloudsight,vufind) that returned the details of the product image but that did not full fill the need, used google suggestqueries, searched out many machine learning algorithms and finally...
I came to know about the "Decision Tree Learning Algorithm" but cannot figure out, how it is applicable to my problem.
I tried out the "PlayingTennis" dataset but couldn't make the sense what to do.
Can you give me some direction that from where to start this journey? Should I focus on The Decision Tree Learning algorithm or Is there any other algorithm you would suggest I should focus on to categorize the products on the basis of context?
If you say , I would share in detail about what things I searched about to solve my problem.
I would suggest to do the following:
Go through items in your dataset and classify them manually (decide for which gender each item is). Store each decision so that you would be able to somehow link each item in an original dataset with a target class.
Develop an algorithm for converting each item from your dataset into a feature vector. This algorithm should be able to convert each item in your original dataset in a vector of numbers (more about how to do it later).
Convert all your dataset with appropriate classes into a dataset that would look like this:
Feature_1, Feature_2, Feature_3, ..., Gender
value_1, value_2, value_3, ... male
It would be a good decision to store it in CSV file since you would be able to load it and process in different machine learning tools (More about those later).
Load dataset you've created at step 3 in machine learning tool of your choice and try to come up with the best model that can classify items in your dataset by gender.
Store model created at step 4. It will be part of your production system.
Develop a production code that can convert an unclassified product, create feature vector out of it and pass this feature vector to the model you've saved at step 5. The result of this operation should be a predicted gender.
Details
If there too many items (say tens of thousands) in your original dataset it may be impractical to classify them yourself. What you can do is to use Amazon Mechanical Turk to simplify your task. If you are unable to use it (the last time I've checked you had to have a USA address to use it) you can just classify few hundreds of items to start working on your model and classify the rest to improve accuracy of your classification (the more training data you use the better the accuracy, but up to a certain point)
How to extract features from a dataset
If keyword has form like tag=true/false, it's a boolean feature.
If keyword has form like tag=42, it's a numerical one or ordinal. For example it can be price value or price range (0-10, 10-50, 50-100, etc.)
If keyword has form like tag=string_value you can convert it into a categorical value
A class (gender) is simply boolean value 0/1
You can experiment a bit with how you extract your features, since it may influence the result accuracy.
How to extract features from product description
There are different ways to convert a text into a feature vector. Look for TF-IDF algorithms or something similar.
Machine learning tools
You can use one of existing machine learning libraries and hack some code that loads your CSV dataset, trains a model and checks the accuracy, but at first I would suggest to use something like Weka. It has more or less intuitive UI and you can quickly start to experiment with different machine learning algorithms, convert different features in your dataset from string to categories, or from real values to ordinal values, etc. Good thing about Weka is that it has Java API, so you can automate all the process of data conversion, train models programmatically, etc.
What algorithms to choose
I would suggest to use decision tree algorithms like C4.5. It's fast and show good results on wide range of machine learning tasks. Additionally you can use ensemble of classifiers. There are various algorithms that can combine several algorithms like (google for boosting or random forest to find out more) usually they give better results, but work more slowly (since you need to run a single feature vector through several algorithms.
One another trick that you can use to make your algorithm more accurate is to use models that work on different sets of features (say one algorithm uses features extracted from tags and another algorithm uses data extracted from product description). You can then combine them using algorithms like stacking to come up with a final result.
For classification on the basis of features extracted from text, you can try to use Naive Bayes algorithm or SVM. They both show good results in text classification.
Do consider Support Vector Classifier (SVC), or for Google's sake the Support Vector Machine (SVM). If You have a large training set (which I suspect) search for implementations that are "fast" or "scalable".

Data matching Algorithm Approach

I don't really know where to start with this project, and so I'm hoping a broad question can at least point me in the right direction.
I have 2 data sets right now, each about 5gb with 2million observations. They are the assessed and historical data gathered for property listings of a given area for a certain amount of time. What I need to do is match properties to one another. So a property may arise in the historical since it gets sold 2 or 3 times during the period. In this historical I have the seller info, the loan info, and sale info. In the assessor data I have all of the characteristics that would describe the property sold. So in order to do any pricing model, I need to match the two.
I have variables that are similar in each, however they are going to differ slightly (misspellings, abbreviations, etc). Does anyone have any recommendations for me about going through this? First off, what program would I want to do this in? I have experience in STATA, R and a little bit of SAS and Matlab, but I'd prefer to use the former two.
I read through this:
Data matching algorithm
Where he uses .NET and one user suggested a Levenshtein approach (where the distance between strings is calculated) so for fields like Address I could use this and weight the approximate accuracy between the two string. Then it was suggested maybe to use Soundex for maybe Name of the seller/owner.
But I'm really lost in how to implement any of this, and before I approach anyone in my department I really need to have some sort of idea of what I'm doing!
Any help or advice would be immensely helpful.
Yes, there are several good algorithms for the string matching problem you describe, namely:
jaro-winkler,
smith-waterman,
dice-sorense
soundex
damerau-levenshtein, and
monge-elkan
to name the few.
I recommend A Comparison of String Distance Metrics for Name-Matching Tasks, by W. W. Cohen, P. Ravikumar, S. Fienberg for an overview of what might be working the best for what.
SoftTFIDF claims to be the best one. It is available as a Java package. There are other implementations of string matching and record linkage algorithms available in:
Java (SecondString),
Python (JellyFish),
C# (FuzzyString), and
Scala StringMetric
libraries.

Natural Language Processing for Smart Homes

I'm writing up a Smart Home software for my bachelor's degree, that will only simulate the actual house, but I'm stuck at the NLP part of the project. The idea is to have the client listen to voice inputs (already done), transform it into text (done) and send it to the server, which does all the heavy lifting / decision making.
So all my inputs will be fairly short (like "please turn on the porch light"). Based on this, I want to take the decision on which object to act, and how to act. So I came up with a few things to do, in order to write up something somewhat efficient.
Get rid of unnecessary words (in the previous example "please" and "the" are words that don't change the meaning of what needs to be done; but if I say "turn off my lights", "my" does have a fairly important meaning).
Deal with synonyms ("turn on lights" should do the same as "enable lights" -- I know it's a stupid example). I'm guessing the only option is to have some kind of a dictionary (XML maybe), and just have a list of possible words for one particular object in the house.
Detecting the verb and subject. "turn on" is the verb, and "lights" is the subject. I need a good way to detect this.
General implementation. How are these things usually developed in terms of algorithms? I only managed to find one article about NLP in Smart Homes, which was very vague (and had bad English). Any links welcome.
I hope the question is unique enough (I've seen NLP questions on SO, none really helped), that it won't get closed.
If you don't have a lot of time to spend with the NLP problem, you may use the Wit API (http://wit.ai) which maps natural language sentences to JSON:
It's based on machine learning, so you need to provide examples of sentences + JSON output to configure it to your needs. It should be much more robust than grammar-based approaches, especially because the voice-to-speech engine might make mistakes that will break your grammar (but the machine learning module can still get the meaning of the sentence).
I am no way a pioneer in NLP(I love it though) but let me try my hand on this one. For your project I would suggest you to go through Stanford Parser
From your problem definition I guess you don't need anything other then verbs and nouns. SP generates POS(Part of speech tags) That you can use to prune the words that you don't require.
For this I can't think of any better option then what you have in mind right now.
For this again you can use grammatical dependency structure from SP and I am pretty much sure that it is good enough to tackle this problem.
This is where your research part lies. I guess you can find enough patterns using GD and POS tags to come up with an algorithm for your problem. I hardly doubt that any algorithm would be efficient enough to handle every set of input sentence(Structured+unstructured) but something that is more that 85% accurate should be good enough for you.
First, I would construct a list of all possible commands (not every possible way to say a command, just the actual function itself: "kitchen light on" and "turn on the light in the kitchen" are the same command) based on the actual functionality the smart house has available. I assume there is a discrete number of these in the order of no more than hundreds. Assign each some sort of identifier code.
Your job then becomes to map an input of:
a sentence of english text
location of speaker
time of day, day of week
any other input data
to an output of a confidence level (0.0 to 1.0) for each command.
The system will then execute the best match command if the confidence is over some tunable threshold (say over 0.70).
From here it becomes a machine learning application. There are a number of different approaches (and furthermore, approaches can be combined together by having them compete based on features of the input).
To start with I would work through the NLP book from Jurafsky/Manning from Stanford. It is a good survey of current NLP algorithms.
From there you will get some ideas about how the mapping can be machine learned. More importantly how natural language can be broken down into a mathematical structure for machine learning.
Once the text is semantically analyzed, the simplest ML algorithm to try first would be of the supervised ones. To generate training data have a normal GUI, speak your command, then press the corresponding command manually. This forms a single supervised training case. Make some large number of these. Set some aside for testing. It is also unskilled work so other people can help. You can then use these as your training set for your ML algorithm.

Separation and pattern matching techniques

I am new to Artificial Neural Networks.
I am interested in an application like this:
I have a significantly large set of objects. Each object has six properties, denoted by P1–P6. Each property has a value which is a symbolic value. In other words, in my example P1–P6 can have a value from the set {A, B, C, D, E, F}. They are not numeric. (Suppose A,B,C,D,E,F are colours; then you will understand my idea.)
Now, there is another property R that I am interested in. Suppose
R = {G1, G2, G3, G4, G5}
I need to train a system for a large set of P1–P6 and the relevant R. Now I want to do the following.
I have an object and I know the values of P1 to P6. I need to find
the R (The Group that the object belongs.)
To get a desired R what is the pattern I need to have in P1–P6.
As an example given that R = G2 I need to figure out any pattern in P1–P6.
My questions are:
What are the theories/technologies/techniques I should read and
learn in order to implement 1 and 2, respectively?
What are the tools/libraries you can recommend to get this
simulated/implemented/tested?
The way you described your problem, you need to look up various machine learning techniques. If it were me, I would try and read about k-NN (k Nearest Neighbours) for the classification. When I say classification, I mean getting the R if you know P1-P6. It is a really simple technique and should be helpful here.
As for the other way around, what you basically need is a representative sample of your population. This is I think not so usual, but you could try something like a k-means Clustering. Clustering methods usually determine the class of an object (property R) by themselves, but k-means Clustering is cool in this situation because you need to give it the number of object classes (e.g. different possible values of R), and in the end you get one representative sample.
You definitely shouldn't go for any really complex techniques (like neural networks) in my opinion since your data doesn't have a precise numerical interpretation and the values can't be interpreted gradually.
The recommended tools really depend on your base programming language. There's a great tool called Orange which is Python-based and it's my tool of choice for these kind of things (especially since it is really easy to connect your Python modules with C/C++). If you prefer Java, there's a quite similar tool called Weka that you could use. I think Weka is a little bit better documented, but I don't like Java so I've never tried it out.
Both of these tools have a graphical clickable interface where you could just load your data and get the classification done, play with the parameters and check what kind of output you get using different techniques and different set-ups. Once you decide that you got the results you need (or if you just don't like graphical interfaces) you can also use both of them as libraries of a kind when programming (Python for Orange and Java for Weka) and make the classification a part of a bigger project.
If you look through the documentation of Orange or Weka, I think it will give you a few ideas about what you could actually do with the data you have and when you know a few techniques that seem interesting to you and applicable to the data, maybe you could get more quality comments and info on a few specific methods here than when just searching for a general advice.
You should check out classification algorithms (a subsection of artificial intelligence), especially the nearest neighbor-algorithms. Your problem may be solved by different techniques, which all have different advantages and disadvantages.
However, I do not know of any method in artificial intelligence, which allows a two-way classification (or in other words, that both implement your prerequisites 1 and 2 simultaneously). As all you want to do so far is having a bidirectional mapping of P1..P6 <=> R, I would suggest to just use a mapping table instead of an artificial intelligence algorithm. An AI would work great if you not exactly know, which of your samples is categorized under A..E in P1..P6.
If you insist on using an AI for it, I'd suggest to first look at a Perceptron. A perceptron consists of input, intermediate and output neurons. For your example, you'd have the input-Neurons P1a..P1e, P2a..P2e, ... and five output neurons R1..R5. After training, you should be able to input P1..P6 and get the appropriate R1..R5 as output.
As for frameworks and technologies, I only know of the Business Intelligence suite for Visual Studio, although there are a lot of other frameworks for AI out there. Since I do not have used any of them (I always coded them myself in C/C++), I can't recommend any.
It seems like a typical classification problem. In case you really have a lot of data have a look at Apache Mahout which provides distributed implementations of machine learning algorithms. If you need something less complex for prototyping TimBL is a nice alternative.

Predicting missing data values in a database

I have a database, consisting of a whole bunch of records (around 600,000) where some of the records have certain fields missing. My goal is to find a way to predict what the missing data values should be (so I can fill them in) based on the existing data.
One option I am looking at is clustering - i.e. representing the records that are all complete as points in some space, looking for clusters of points, and then when given a record with missing data values try to find out if there are any clusters that could belong in that are consistent with the existing data values. However this may not be possible because some of the data fields are on a nominal scale (e.g. color) and thus can't be put in order.
Another idea I had is to create some sort of probabilistic model that would predict the data, train it on the existing data, and then use it to extrapolate.
What algorithms are available for doing the above, and is there any freely available software that implements those algorithms (This software is going to be in c# by the way).
This is less of an algorithmic and more of a philosophical and methodological question. There are a few different techniques available to tackle this kind of question. Acock (2005) gives a good introduction to some of the methods. Although it may seem that there is a lot of math/statistics involved (and may seem like a lot of effort), it's worth thinking what would happen if you messed up.
Andrew Gelman's blog is also a good resource, although the search functionality on his blog leaves something to be desired...
Hope this helps.
Acock (2005)
http://oregonstate.edu/~acock/growth-curves/working%20with%20missing%20values.pdf
Andrew Gelman's blog
http://www.stat.columbia.edu/~cook/movabletype/mlm/
Dealing with missing values is a methodical question that has to do with the actual meaning of the data.
Several methods you can use (detailed post on my blog):
Ignore the data row. This is usually done when the class label is missing (assuming you data mining goal is classification), or many attributes are missing from the row (not just one). However you'll obviously get poor performance if the percentage of such rows is high
Use a global constant to fill in for missing values. Like "unknown", "N/A" or minus infinity. This is used because sometimes is just doesnt make sense to try and predict the missing value. For example if you have a DB if, say, college candidates and state of residence is missing for some, filling it in doesn't make much sense...
Use attribute mean. For example if the average income of a US family is X you can use that value to replace missing income values.
Use attribute mean for all samples belonging to the same class. Lets say you have a cars pricing DB that, among other things, classifies cars to "Luxury" and "Low budget" and you're dealing with missing values in the cost field. Replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value you'd get if you factor in the low budget cars
Use data mining algorithm to predict the value. The value can be determined using regression, inference based tools using Baysian formalism , decision trees, clustering algorithms used to generate input for step method #4 (K-Mean\Median etc.)
I'd suggest looking into regression and decision trees first (ID3 tree generation) as they're relatively easy and there are plenty of examples on the net.
As for packages, if you can afford it and you're in the Microsoft world look at SQL Server Analysis Services (SSAS for short) that implement most of the mentioned above.
Here are some links to free data minning software packages:
WEKA - http://www.cs.waikato.ac.nz/ml/weka/index.html
ORANGE - http://www.ailab.si/orange
TANAGRA - http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html
Although not C# he's a pretty good intro to decision trees and baysian learning (using Ruby):
http://www.igvita.com/2007/04/16/decision-tree-learning-in-ruby/
http://www.igvita.com/2007/05/23/bayes-classification-in-ruby/
There's also this Ruby library that I find very useful (also for learning purposes):
http://ai4r.rubyforge.org/machineLearning.html
There should be plenty of samples for these algorithms online in any language so I'm sure you'll easily find C# stuff too...
Edited:
Forgot this in my original post. This is a definately MUST HAVE if you're playing with data mining...
Download Microsoft SQL Server 2008 Data Mining Add-ins for Microsoft Office 2007 (It requires SQL Server Analysis Services - SSAS - which isn't free but you can download a trial).
This will allow you to easily play and try out the different techniques in Excel before you go and implement this stuff yourself. Then again, since you're in the Microsoft ecosystem, you might even decide to go for an SSAS based solution and count on the SQL Server guys to do it for ya :)
Predicting missing values is generally considered to be part of data cleansing phase which needs to be done before the data is mined or analyzed further. This is quite prominent in real world data.
Please have a look at this algorithm http://arxiv.org/abs/math/0701152
Currently Microsoft SQL Server Analysis Services 2008 also comes with algorithms like these http://technet.microsoft.com/en-us/library/ms175312.aspx which help in predictive modelling of attributes.
cheers

Resources