I am looking to develop a machine learning algorithm to parse a sentence and identify various parts in it. Here is what I mean:
Consider the sentence 'Demonstrate me the procedure to turn on a fan when then there is no electricity'. I would like to split the sentence into:
Command: 'Demonstrate'
Action: 'Turn on a fan'
Condition: 'When there is no electricity'
The way I plan to do this is by using a large number of sample inputs of sentences and specifying target outputs in each case. Then, I would use an appropriate machine learning algorithm for classification.
The problem I am facing is with data preparation for machine learning training. So far, I have thought about the following approach:
1- Parse the sentence and determine POS of each word. Classify each word as 1-7 depending upon the part of speech. Combine each word and the sentence will get a specific code: e.g, 102163374. Use that as an independent feature.
2- Use the total number of words as the second independent feature.
The precise problem with this approach is that the first feature will vary very heavily depending on the number of words there are in the sentence. Is that a problem? If so, how do I deal with it?
Related
I'm new to the machine learning and I want to identify the given words using an algorithm.
As an example,
construct the triangle ABC such that AB=7cm, BAC=60 and AC=5.5cm
construct the square that is 7cm long each side.
in this example I need to identify the words triangle and square.
So it seems like you want to be the algorithm to be intelligence rather than just identifying couple of words. So in order to that you should go for Natural language processing. There you can identifying nouns of different geometrical objects and in order to gather those information, I mean if you want to list that AB=7, BAC=60 and AC=505 then learn recurrent neural networks (RNN). RNN models can remember what you have said at the beginning of the sentence and identify what details are belongs to that.
ex - Anna lives in Paris and she's fluent in French.
RNN can identify the word French.
So just by using a machine learning algorithm it is not possible to identify those words and gather details when provide such a sentence.
You can read this article for further understanding.
Understanding LSTM Networks
All,
I have been running Y!LDA (https://github.com/shravanmn/Yahoo_LDA) on a set of documents and the results look great (or at least what I would expect). Now I want to use the resulting topics to perform a reverse query against the corpus. Does anyone know if the 3 human readable text files that are generated after the learntopics executable is run is the final output for this library? If so, is that what I need to parse to perform my queries? I am stuck with a little shoulder shrugging at this point...
Thanks,
Adam
If LDA is working the way I think it is (I use a java implementation, so explanations may vary) then what you get out are the three following things:
P(word,concept) -- The probability of getting a word given a concept. So, when LDA finishes figuring out what concepts exist within the corpus, this P(w,c) will tell you (in theory) which words map to which concepts.
A very naive method of determining concepts would be to load this file into a matrix and combine all these probabilities for all possible concepts for a test document in some method (add, multiply, Root-mean-squared) and rank order the concepts.
Do note that the above method does not recognize the various biases introduced by weakly represented topics or dominating topics in LDA. To accommodate that, you need more complicated algorithms (Gibbs sampling, for instance), but this will get you some results.
P(concept,document) -- If you are attempting to find the intrinsic concepts in the documents in the corpus, you would look here. You can use the documents as examples of documents that have a particular concept distribution, and compare your documents to the LDA corpus documents... There are uses for this, but it may not be as useful as the P(w,c).
Something else probably relating to the weights of words, documents, or concepts. This could be as simple as a set of concept examples with beta weights (for the concepts), or some other variables that are output from LDA. These may or may not be important depending on what you are doing. (If you are attempting to add a document to the LDA space, having the alpha or beta values -- very important.)
To answer your 'reverse lookup' question, to determine the concepts of the test document, use P(w,c) for each word w in the test document.
To determine which document is the most like the test document, determine the above concepts, then compare them to the concepts for each document found in P(c,d) (using each concept as a dimension in vector-space and then determining a cosine between the two documents tends to work alright).
To determine the similarity between two documents, same thing as above, just determine the cosine between the two concept-vectors.
Hope that helps.
I'm fairly new at machine learning and text mining in general. It has come to my attention the presence of a ruby library called Liblinear https://github.com/tomz/liblinear-ruby-swig.
What I want to do so far is train the software to identify whether a text mentions anything related to bicycles or not.
Can someone please highlight the steps that I should be following (i.e: preprocessing text and how), share resources and ideally share a simple example to get me going.
Any help will do, thanks!
The classical approach is:
Collect a representative sample of input texts, each labeled as related/unrelated.
Divide the sample into training and test sets.
Extract all the terms in all the documents of the training set; call this the vocabulary, V.
For each document in the training set, convert it into a vector of booleans where the i'th element is true/1 iff the i'th term in the vocabulary occurs in the document.
Feed the vectorized training set to the learning algorithm.
Now, to classify a document, vectorize it as in step 4. and feed it to the classifier to get a related/unrelated label for it. Compare this with the actual label to see if it went right. You should be able to get at least some 80% accuracy with this simple method.
To improve this method, replace the booleans with term counts, normalized by document length, or, even better, tf-idf scores.
The problem
I am trying to improve the result of an OCR process by combining the output from three different OCR systems (tesseract, cuneinform, ocrad).
I already do image preprocessing (deskewing, despeckling, threholding and some more). I don't think that this part can be improved much more.
Usually the text to recognize is between one and 6 words long. The lanuage of the text is unknown and quite often they contain fantasy words.
I am on Linux. Preferred language would be Python.
What I have so far
Often every result has one or two errors. But they have errors at different characters/positions. Errors could be that they recognize a wrong character or that they include a non existing character. Not so often they ignore a character.
An example might look in the following way:
Xorem_ipsum
lorXYm_ipsum
lorem_ipuX
A X is a wrong recognized character and an Y is a character which does not exist in the text. Spaces are replaced by "_" for better readibilty.
In cases like this I try to combine the different results.
Using repeatedly the "longest common substring" algorithm between the three pairs I am able to get the following structure for the given example
or m_ipsum
lor m_ip u
orem_ip u
But here I am stuck now. I am not able to combine those pieces to a result.
The questions
Do you have
an idea how to combine the different
common longest substrings?
Or do you have a better idea how to solve this problem?
It all depends on the OCR engines you are using as to the quality of the results you can expect to get. You may find that by choosing a higher quality OCR engine that gives you confidence levels and bounding boxes would give you much better raw results in the first place and then extra information that could be used to determine the correct result.
Using Linux will restrict the possible OCR engines available to you. Personally I would rate Tesseract as 6.5/10 compared to commercial OCR engines available under Windows.
http://www.abbyy.com/ocr_sdk_linux/overview/ - The SDK may not be cheap though.
http://irislinktest.iriscorporate.com/c2-1637-189/iDRS-14-------Recognition--Image-preprocessing--Document-formatting-and-more.aspx - Available for Linux
http://www.rerecognition.com/ - Is available as a Linux version. This engine is used by many other companies.
All of the engines above should give you confidence levels, bounding boxes and better results than Tesseract OCR.
https://launchpad.net/cuneiform-linux - Cuneiform, now open sourced and running under Linux. This is likely one of your three engnines you are using. If not you should probably look at adding it.
Also you may want to look at http://tev.fbk.eu/OCR/Products.html for more options.
Can you past a sample or two of typical images and the OCR results from the engines. There are other ways to improve OCR recognition but it would depend on the images.
Maybe repeat the "longest common substring" until all results are the same.
For your example, you would get the following in the next step:
or m_ip u
or m_ip u
or m_ip u
OR do the "longest common substring" algorithm with the first and second string and then again the result with the third string. So you get the same result or m_ip u more easy.
So you can assume that letters should be correct. Now look at the spaces. Before or there are two times l and once X, so choose l. Between or and m_ip there are two times e and once XY, so choose e. And so on.
I'm new to OCR, but until now I find out that those systems are build to work based on a dictionary of words rather than letter by letter. So, if your images doesn't have real words, maybe you will have to look closer to the letter recognition & training part of the systems you are using.
I afforded a very similar problem.
I hope that this can help: http://dl.tufts.edu/catalog/tufts:PB.001.011.00001
See also software developed by Bruce Robertson: https://github.com/brobertson/rigaudon
I'm looking for an algorithm or example material to study for predicting future events based on known patterns. Perhaps there is a name for this, and I just don't know/remember it. Something this general may not exist, but I'm not a master of math or algorithms, so I'm here asking for direction.
An example, as I understand it would be something like this:
A static event occurs on January 1st, February 1st, March 3rd, April 4th. A simple solution would be to average the days/hours/minutes/something between each occurrence, add that number to the last known occurrence, and have the prediction.
What am I asking for, or what should I study?
There is no particular goal in mind, or any specific variables to account for. This is simply a personal thought, and an opportunity for me to learn something new.
I think some topics that might be worth looking into include numerical analysis, specifically interpolation, extrapolation, and regression.
This could be overkill, but Markov chains can lead to some pretty cool pattern recognition stuff. It's better suited to, well, chains of events: the idea is, based on the last N steps in a chain of events, what will happen next?
This is well suited to text: process a large sample of Shakespeare, and you can generate paragraphs full of Shakespeare-like nonsense! Unfortunately, it takes a good deal more data to figure out sparsely-populated events. (Detecting patterns with a period of a month or more would require you to track a chain of at least a full month of data.)
In pseudo-python, here's a rough sketch of a Markov chain builder/prediction script:
n = how_big_a_chain_you_want
def build_map(eventChain):
map = defaultdict(list)
for events in get_all_n_plus_1_item_slices_of(eventChain):
slice = events[:n]
last = events[-1]
map[slice].append(last)
def predict_next_event(whatsHappenedSoFar, map):
slice = whatsHappenedSoFar[-n:]
return random_choice(map[slice])
There is no single 'best' canned solution, it depends on what you need. For instance, you might want to average the values as you say, but using weighted averages where the old values do not contribute as much to the result as the new ones. Or you might try some smoothing. Or you might try to see if the distribution of events fits a well-kjnown distribution (like normal, Poisson, uniform).
If you have a model in mind (such as the events occur regularly), then applying a Kalman filter to the parameters of that model is a common technique.
The only technique I've worked with for trying to do something like that would be training a neural network to predict the next step in the series. That implies interpreting the issue as a problem in pattern classification, which doesn't seem like that great a fit; I have to suspect there are less fuzzy ways of dealing with it.
The task is very similar to language modelling task where given a sequence of history words the model tries to predict a probability distribution over vocabulary for the next word.
There are open source softwares such as SRILM and NLTK that can simply get your sequences as input sentences (each event_id is a word) and do the job.
if you merely want to find the probability of an event occurring after n days given prior data of its frequency, you'll want to fit to an appropriate probability distribution, which generally requires knowing something about the source of the event (maybe it should be poisson distributed, maybe gaussian). if you want to find the probability of an event happening given that prior events happened, you'll want to look at bayesian statistics and how to build a markov chain from that.
You should google Genetic Programming Algorithms
They (sort of like the Neural Networks mentioned by Chaos) will enable you to generate solutions programmatically, then have the program modify itself based on a criteria, and create new solutions which are hopefully closer to accurate.
Neural Networks would have to be trained by you, but with genetic programming, the program will do all the work.
Although it is a hell of a lot of work to get them running in the first place!