How to extract features from plain text? - algorithm

I am writing a text parser which should extract features from product descriptions.
Eg:
text = "Canon EOS 7D Mark II Digital SLR Camera with 18-135mm IS STM Lens"
features = extract(text)
print features
Brand: Canon
Model: EOS 7D
....
The way I do this is by training the system with structured data and coming up with an inverted index which can map a term to a feature. This works mostly well.
When the text contains measurements like 50ml, or 2kg, the inverted index will say 2kg -> Size and 50ml -> Size for eg.
The problem here is that, when I get a value which I haven't seen before, like 13ml, it won't be processed. But since the patterns matches to a size, we could tag it as size.
I was thinking to solve this problem by preprocessing the tokens that I get from the text and look for patterns that I know. So when new patterns are identified, that has to be added to the preprocessing.
I was wondering, is this the best way to go about this? Or is there a better way of doing this?

The age-old problem of unseen cases. You could train your scraper to grab any number-like characters preceding certain suffixes (ml, kg, etc) and treat those as size. The problem with this is typos and other poorly formatted texts could enter into your structure data. There is no right answer for how to handle values you haven't seen before - you'll either have to QC them individually, or have rules around them. This is dependent on your dataset.
As far as identifying patterns, you'll either have to manually enter them, or manually classify a lot of records and let the algorithm learn them. Not sure that's very helpful, but a lot of this is very dependent on your data.

If you have a training data like this:
word label
10ml size-valume
20kg size-weight
etc...
you could train a classifier based on character n-grams and that would detect that ml is size-volume even if it sees a 11-ml or ml11 etc. you should also convert the numbers into a single number (e.g. 0) so that 11-ml is seen as 0-ml before feature extraction.
For that you'll need a preprocessing module and also a large training sample. For feature extraction you can use scikit-learn's character n-grams and also SVM.

Related

Correct way to represent documents containing multiple sentences in gensim file-based training

I am trying to use gensim's file-based training (example from documentation below):
from multiprocessing import cpu_count
from gensim.utils import save_as_line_sentence
from gensim.test.utils import get_tmpfile
from gensim.models import Word2Vec, Doc2Vec, FastText
# Convert any corpus to the needed format: 1 document per line, words delimited by " "
corpus = api.load("text8")
corpus_fname = get_tmpfile("text8-file-sentence.txt")
save_as_line_sentence(corpus, corpus_fname)
# Choose num of cores that you want to use (let's use all, models scale linearly now!)
num_cores = cpu_count()
# Train models using all cores
w2v_model = Word2Vec(corpus_file=corpus_fname, workers=num_cores)
d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores)
ft_model = FastText(corpus_file=corpus_fname, workers=num_cores)
However, my actual corpus contains many documents, each containing many sentences.
For example, let's assume my corpus is the plays of Shakespeare - Each play is a document, each document has many many sentences, and I would like to learn embeddings for each play, but the word embeddings only from within the same sentence.
Since the file-based training is meant to be one document per line, I assume that I should put one play per line. However, the documentation for file-based-training doesn't have an example of any documents with multiple sentences.
Is there a way to peek inside the model to see the documents and word context pairs that have been found before they are trained?
What is the correct way to build this file, maintaining sentence boundaries?
Thank you
These algorithm implementations don't have any real understanding of, or dependence on, actual sentences. They just take texts – runs of word-tokens.
Often the texts provided to Word2Vec will be multiple sentences. Sometimes punctuation like sentence-ending periods are even retained as pseudo-words. (And when the sentences were really consecutive with each other in the source data, the overlapping word-context windows, between sentences, may even be a benefit.)
So you don't have to worry about "maintaining sentence boundaries". Any texts you provide that are sensible units of words that really co-occur will work about as well. (Especially in Word2Vec and FastText, even changing your breaks between texts to be sentences, or paragraphs, or sections, or documents is unlikely to have very much effect on the final word-vectors – it's just changing a subset of the training contexts, and probably not in any way that significantly changes which words influence which other words.)
There is, however, another implementation limit in gensim that you should watch out for: each training text can only be 10,000 tokens long, and if you supply larger texts, the extra tokens will be silently ignored.
So, be sure to use texts that are 10k tokens or shorter – even if you have to arbitrarily split longer ones. (Per above, any such arbitrary extra break in the token grouping is unlikely to have a noticeable effect on results.)
However, this presents a special problem using Doc2Vec in corpus_file mode, because in that mode, you don't get to specify your preferred tags for a text. (A text's tag, in this mode, is essentially just the line-number.)
In the original sequence corpus mode, the workaround for this 10k token limit was just to break up larger docs into multiple docs - but use the same repeated tags for all sub-documents from an original document. (This very closely approximates how a doc of any size would affect training.)
If you have documents with more than 10k tokens, I'd recommend either not using corpus_file mode, or figuring some way to use logical sub-documents of less than 10k tokens, then perhaps modeling your larger docs as the set of their sub-documents, or otherwise adjusting your downstream tasks to work on the same sub-document units.

Predicting phrases instead of just next word

For an application that we built, we are using a simple statistical model for word prediction (like Google Autocomplete) to guide search.
It uses a sequence of ngrams gathered from a large corpus of relevant text documents. By considering the previous N-1 words, it suggests the 5 most likely "next words" in descending order of probability, using Katz back-off.
We would like to extend this to predict phrases (multiple words) instead of a single word. However, when we are predicting a phrase, we would prefer not to display its prefixes.
For example, consider the input the cat.
In this case we would like to make predictions like the cat in the hat, but not the cat in & not the cat in the.
Assumptions:
We do not have access to past search statistics
We do not have tagged text data (for instance, we do not know the parts of speech)
What is a typical way to make these kinds of multi-word predictions? We've tried multiplicative and additive weighting of longer phrases, but our weights are arbitrary and overfit to our tests.
For this question, you need to define what it is you consider to be a valid completion -- then it should be possible to come up with a solution.
In the example you've given, "the cat in the hat" is much better than "the cat in the". I could interpret this as, "it should end with a noun" or "it shouldn't end with overly common words".
You've restricted the use of "tagged text data" but you could use a pretrained model, (e.g. NLTK, spacy, StanfordNLP) to guess the parts of speech and make an attempt to restrict predictions to only complete noun-phrases (or sequence ending in noun). Note that you would not necessarily need to tag all documents fed into the model, but only those phrases you're keeping in your autocomplete db.
Alternately, you could avoid completions that end in stopwords (or very high frequency words). Both "in" and "the" are words that occur in almost all English documents, so you could experimentally find a frequency cutoff (can't end in a word that occurs in more than 50% of documents) that help you filter. You could also look at phrases -- if the end of the phrase is drastically more common as a shorter phrase, then it doesn't make sense to tag it on, as the user could come up with it on their own.
Ultimately, you could create a labeled set of good and bad instances and try to create a supervised re-ranker based on word features -- both ideas above could be strong features in a supervised model (document frequency = 2, pos tag = 1). This is typically how search engines with data can do it. Note that you don't need search statistics or users for this, just a willingness to label the top-5 completions for a few hundred queries. Building a formal evaluation (that can be run in an automated manner) would probably help when trying to improve the system in the future. Any time you observe a bad completion, you could add it to the database and do a few labels -- over time, a supervised approach would get better.

Predicting Missing Word in Sentence

How can I predict a word that's missing from a sentence?
I've seen many papers on predicting the next word in a sentence using an n-grams language model with frequency distributions from a set of training data. But instead I want to predict a missing word that's not necessarily at the end of the sentence. For example:
I took my ___ for a walk.
I can't seem to find any algorithms that take advantage of the words after the blank; I guess I could ignore them, but they must add some value. And of course, a bi/trigram model doesn't work for predicting the first two words.
What algorithm/pattern should I use? Or is there no advantage to using the words after the blank?
Tensorflow has a tutorial to do this: https://www.tensorflow.org/versions/r0.9/tutorials/word2vec/index.html
Incidentally it does a bit more and generates word embeddings, but to get there they train a model to predict the (next/missing) word. They also show using only the previous words, but you can apply the same ideas and add the words that follow.
They also have a bunch of suggestions on how to improve the precision (skip ngrams).
Somewhere at the bottom of the tutorial you have links to working source-code.
The only thing to be worried about is to have sufficient training data.
So, when I've worked with bigrams/trigrams, an example query generally looked something like "Predict the missing word in 'Would you ____'". I'd then go through my training data and gather all the sets of three words matching that pattern, and count the things in the blanks. So, if my training data looked like:
would you not do that
would you kindly pull that lever
would you kindly push that button
could you kindly pull that lever
I would get two counts for "kindly" and one for "not", and I'd predict "kindly". All you have to do for your problem is consider the blank in a different place: "____ you kindly" would get two counts for "would" and one for "could", so you'd predict "would". As far as the computer is concerned, there's nothing special about the word order - you can describe whatever pattern you want, from your training data. Does that make sense?

How can I get popular tags/keywords from a collection of unstructured text chunks?

I am storing small chunks of texts - say of around 100 - 200 words - in a NoSQL database, and need to display the trending keywords/tags among all of these chunks.
I know of text analysis APIs like alchemy which extract entities from a single chunk of text, but I want top keywords/tags among all the chunks.
Should I store keywords against each text-chunk and then do an exhaustive counting of the top keywords? In which case, each keyword may differ slightly and may lead to fragmentation of similar keywords.
Its not always necessary that filtering out entities would provide you the result (thought it serves a basic purpose). If you want it to be more effective you should remove the stopwords, do stemming, UpperCase to LowerCase converstion, spelling correction and then use a HashMap to find frequencies.
Using this frequency you can filter out top 100-200 entities/tags.
I hope this helps.

How to spot and analyse similar patterns like Excel does?

You know the functionality in Excel when you type 3 rows with a certain pattern and drag the column all the way down Excel tries to continue the pattern for you.
For example
Type...
test-1
test-2
test-3
Excel will continue it with:
test-4
test-5
test-n...
Same works for some other patterns such as dates and so on.
I'm trying to accomplish a similar thing but I also want to handle more exceptional cases such as:
test-blue-somethingelse
test-yellow-somethingelse
test-red-somethingelse
Now based on this entries I want say that the pattern is:
test-[DYNAMIC]-something
Continue the [DYNAMIC] with other colours is whole another deal, I don't really care about that right now. I'm mostly interested in detecting the [DYNAMIC] parts in the pattern.
I need to detect this from a large of pool entries. Assume that you got 10.000 strings with this kind of patterns, and you want to group these strings based on similarity and also detect which part of the text is constantly changing ([DYNAMIC]).
Document classification can be useful in this scenario but I'm not sure where to start.
UPDATE:
I forgot to mention that also it's possible to have multiple [DYNAMIC] patterns.
Such as:
test_[DYNAMIC]12[DYNAMIC2]
I don't think it's important but I'm planning to implement this in .NET but any hint about the algorithms to use would be quite helpful.
As soon as you start considering finding dynamic parts of patterns of the form : <const1><dynamic1><const2><dynamic2>.... without any other assumptions then you would need to find the longest common subsequence of the sample strings you have provided. For example if I have test-123-abc and test-48953-defg then the LCS would be test- and -. The dynamic parts would then be the gaps between the result of the LCS. You could then look up your dynamic part in an appropriate data structure.
The problem of finding the LCS of more than 2 strings is very expensive, and this would be the bottleneck of your problem. At the cost of accuracy you can make this problem tractable. For example, you could perform LCS between all pairs of strings, and group together sets of strings having similar LCS results. However, this means that some patterns would not be correctly identified.
Of course, all this can be avoided if you can impose further restrictions on your strings, like Excel does which only seems to allow patterns of the form <const><dynamic>.
finding [dynamic] isnt that big of deal, you can do that with 2 strings - just start at the beginning and stop when they start not-being-equals, do the same from the end, and voila - you got your [dynamic]
something like (pseudocode - kinda):
String s1 = 'asdf-1-jkl';
String s2= 'asdf-2-jkl';
int s1I = 0, s2I = 0;
String dyn1, dyn2;
for (;s1I<s1.length()&&s2I<s2.length();s1I++,s2I++)
if (s1.charAt(s1I) != s2.charAt(s2I))
break;
int s1E = s1.length(), s2E = s2.length;
for (;s2E>0&&s1E>0;s1E--,s2E--)
if (s1.charAt(s1E) != s2.charAt(s2E))
break;
dyn1 = s1.substring(s1I, s1E);
dyn2 = s2.substring(s2I, s2E);
About your 10k data-sets. You would need to call this (or maybe a little more optimized version) with each combination to figure out your patten (10k x 10k calls). and then sort the result by pattern (ie. save the begin and the ending and sort by these fields)
I think what you need is to compute something like the Levenshtein distance, to find the group of similar strings, and then in each group of similar strings, you indentify the dynamic part in a typical diff-like algorithm.
Google docs might be better than excel for this sort of thing, believe it or not.
Google has collected massive amounts of data on sets - for example the in the example you gave it would recognise the blue, red, yellow ... as part of the set 'colours'. It has far more complete pattern recognition than Excel so would stand a better chance of continuing the pattern.

Resources