Document Features Vector Representation - algorithm

I am building a document classifier to categorize documents.
So first step is to represent each documents as "features vector" for the training purpose.
After some research, I found that I can use either the Bag of Words approach or N-gram approach to represent a document as a vector.
The text in each document (scanned pdfs and images) is retrieved using an OCR, thus some words contain errors. And I don't have previous knowledge about the language used in these documents (can't use stemming).
So as far as I understand I have to use the n-gram approach. or are there other approaches to represent a document ?
I would also appreciate if someone could link me to an N-Gram guide in order to have a clearer picture and understand how it works.
Thanks in Advance

Use language detection to get document's language (my favorite tool is LanguageIdentifier from Tika project, but many others are available).
Use spell correction (see this question for some details).
Stem words (if you work in Java environment, Lucene is your choice).
Collect all N-grams (see below).
Make instances for classification by extracting n-grams from particular documents.
Build classifier.
N-gram models
N-grams are just sequences of N items. In classification by topic you normally use N-grams of words or their roots (though there are models based on N-grams of chars). Most popular N-grams are unigrams (just word), bigrams (2 serial words) and trigrams (3 serial words). So, from sentence
Hello, my name is Frank
you should get following unigrams:
[hello, my, name, is, frank] (or [hello, I, name, be, frank], if you use roots)
following bigrams:
[hello_my, my_name, name_is, is_frank]
and so on.
At the end your feature vector should have as much positions (dimensions) as there are words in all your text plus 1 for unknown words. Every position in instance vector should somehow reflect number of corresponding words in instance text. This may be number of occurrences, binary feature (1 if word occurs, 0 otherwise), normalized feature or tf-idf (very popular in classification by topic).
Classification process itself is the same as for any other domain.

Related

Predicting phrases instead of just next word

For an application that we built, we are using a simple statistical model for word prediction (like Google Autocomplete) to guide search.
It uses a sequence of ngrams gathered from a large corpus of relevant text documents. By considering the previous N-1 words, it suggests the 5 most likely "next words" in descending order of probability, using Katz back-off.
We would like to extend this to predict phrases (multiple words) instead of a single word. However, when we are predicting a phrase, we would prefer not to display its prefixes.
For example, consider the input the cat.
In this case we would like to make predictions like the cat in the hat, but not the cat in & not the cat in the.
Assumptions:
We do not have access to past search statistics
We do not have tagged text data (for instance, we do not know the parts of speech)
What is a typical way to make these kinds of multi-word predictions? We've tried multiplicative and additive weighting of longer phrases, but our weights are arbitrary and overfit to our tests.
For this question, you need to define what it is you consider to be a valid completion -- then it should be possible to come up with a solution.
In the example you've given, "the cat in the hat" is much better than "the cat in the". I could interpret this as, "it should end with a noun" or "it shouldn't end with overly common words".
You've restricted the use of "tagged text data" but you could use a pretrained model, (e.g. NLTK, spacy, StanfordNLP) to guess the parts of speech and make an attempt to restrict predictions to only complete noun-phrases (or sequence ending in noun). Note that you would not necessarily need to tag all documents fed into the model, but only those phrases you're keeping in your autocomplete db.
Alternately, you could avoid completions that end in stopwords (or very high frequency words). Both "in" and "the" are words that occur in almost all English documents, so you could experimentally find a frequency cutoff (can't end in a word that occurs in more than 50% of documents) that help you filter. You could also look at phrases -- if the end of the phrase is drastically more common as a shorter phrase, then it doesn't make sense to tag it on, as the user could come up with it on their own.
Ultimately, you could create a labeled set of good and bad instances and try to create a supervised re-ranker based on word features -- both ideas above could be strong features in a supervised model (document frequency = 2, pos tag = 1). This is typically how search engines with data can do it. Note that you don't need search statistics or users for this, just a willingness to label the top-5 completions for a few hundred queries. Building a formal evaluation (that can be run in an automated manner) would probably help when trying to improve the system in the future. Any time you observe a bad completion, you could add it to the database and do a few labels -- over time, a supervised approach would get better.

How to correct OCR segmentation errors using bounding rectangles?

I'm using tesseract for OCR and have noticed, that sometimes segmentation errors occur and characters, that "obviously" belong together are split into separted strings.
Based on a list of characters and their bounding boxes found in one text line and the prilimanary OCR result suggesting, which of these characters belong to one word, which algorithms can I apply to correct segmentation errors or verify the result?
So this this the data available:
List<Word> words;
for(Word word : words){
for(Char c : word.getChars()){
char ch = c.getValue();
Rectangle rect = c.getRect();
}
}
For OCR post-correction that takes into account the characters and words, but admittedly not the bounding boxes, one common practice is
to use a dictionary of valid words, as comprehensive as possible
to check the words that results from the OCR algorithm against that dictionary
if a word cannot be found as an exact match in the dictionary, to try and find a similar one
To make this possible, you need to prepare the dictionary implementation so it enables a search for similar strings, also known as approximate string matching or fuzzy string matching.
The two main approaches for this that I am aware of are
Levenshtein atomata, as described by Schulz et al (DOI: 10.1007/s10032-002-0082-8)
Metric trees, such as the BK tree described by Baeza-Yates and Navarro (DOI: 10.1109/SPIRE.1998.712978)
These approaches, as well as general approximate string matching approaches (such as search tries, q-grams matching and n-gram matching) all inherently use some kind of edit distance measure, more or less similar to Levenshtein distance. After analysing the specific OCR errors you are dealing with, you might want to adjust the edit distance algorithm and the other resources you are using to your specific needs. This may involve things like:
Assume a lower substitution distance between characters that your OCR program confuses frequently, or characters that look particularly similar when rendered with the font or style you are dealing with
Take possible segmentation errors into account by putting frequently occurring pairs of words in the dictionary (in addition to single words)
Ensure that your dictionary contains as many named entities and other domain specific (or corpus specific) elements
Further more, you can try to use a grammar, and/or a statistical language model, such as a Hidden Markov Model or Conditional Random Field model -- similar to the models used by POS taggers -- to make word corrections in context.

Machine Learning: Good way to represent word features

Not quite sure if this is the right place or not..
But here is my question.
So for features which are numeric in nature, it is quite natural to represent them, plot them, etc., but what about words?
How do you deal with data where you have words as features? So let's say I have a dataset with following features:
InventoryVal, Number of Units, Avg Price, Category of Event and so on..
InventoryVal is a number
Number of Units is a number
Avg Price is a number
Category of Event is a word that is assigned by humans.
Event if I replace category (example) "books" by an id...... (say 1) but then that is also something which I have assigned and that's not something intrinsic of data.
What is a good metric to represent that a product belongs to category "art" without artificially assigning anything?
Eghh.. too vague or loosely worded question?/
So as you might have guessed there are entire ML libraries directed to this problem, but if you just want to get started, the simplest (and perhaps most common) is word frequency. In other words, you represent each word as a feature whose value is a function of the number of times that words occurs in each document.
But the most common words (a, and, the, this, etc.) are the most commonly occurring (in ordinary text documents (e.g, email messages) but are hardly the most important, so it is common to express a word feature as the inverse of it's frequency.
So again, this is the simplest methodology (bag of words is how it's usually referred to); more sophisticated analysis (which are not always required) pre-process the individual words to categorize them into e.g., parts-of-speech analysis.
If you like python, i recommend NLTK (Natural Language Tool Kit) is a mature and well-documented python library. There are quite a few "getting started" tutorials, but perhaps begin with ones created by the NLTK contributors and which are referenced on the NLTK homepage; these tutorials usually rely on corpus (data set) included in the base NLTK install.
If you are using an existing machine learning package, or a packaged machine learning algorithm, there may be a way to tell it that a particular field holds e.g. integers which are to be treated as identifiers, in which only comparisons for equality and inequality make sense. If not, if there are only a small number of distinct categories, it might make sense to replace a category field with 10 values with 10 binary fields, holding 1 if the object is in that particular category, or 0 if not (or 9 fields, with the object in the 10th category if all of them are 0).

Classifying english words into rare and common

I'm trying to devise a method that will be able to classify a given number of english words into 2 sets - "rare" and "common" - the reference being to how much they are used in the language.
The number of words I would like to classify is bounded - currently at around 10,000, and include everything from articles, to proper nouns that could be borrowed from other languages (and would thus be classified as "rare"). I've done some frequency analysis from within the corpus, and I have a distribution of these words (ranging from 1 use, to tops about 100).
My intuition for such a system was to use word lists (such as the BNC word frequency corpus, wordnet, internal corpus frequency), and assign weights to its occurrence in one of them.
For instance, a word that has a mid level frequency in the corpus, (say 50), but appears in a word list W - can be regarded as common since its one of the most frequent in the entire language. My question was - whats the best way to create a weighted score for something like this? Should I go discrete or continuous? In either case, what kind of a classification system would work best for this?
Or do you recommend an alternative method?
Thanks!
EDIT:
To answer Vinko's question on the intended use of the classification -
These words are tokenized from a phrase (eg: book title) - and the intent is to figure out a strategy to generate a search query string for the phrase, searching a text corpus. The query string can support multiple parameters such as proximity, etc - so if a word is common, these params can be tweaked.
To answer Igor's question -
(1) how big is your corpus?
Currently, the list is limited to 10k tokens, but this is just a training set. It could go up to a few 100k once I start testing it on the test set.
2) do you have some kind of expected proportion of common/rare words in the corpus?
Hmm, I do not.
Assuming you have a way to evaluate the classification, you can use the "boosting" approach to machine learning. Boosting classifiers use a set of weak classifiers combined to a strong classifier.
Say, you have your corpus and K external wordlists you can use.
Pick N frequency thresholds. For example, you may have 10 thresholds: 0.1%, 0.2%, ..., 1.0%.
For your corpus and each of the external word lists, create N "experts", one expert per threshold per wordlist/corpus, total of N*(K+1) experts. Each expert is a weak classifier, with a very simple rule: if the frequency of the word is higher than its threshold, they consider the word to be "common". Each expert has a weight.
The learning process is as follows: assign the weight 1 to each expert. For each word in your corpus, make the experts vote. Sum their votes: 1 * weight(i) for "common" votes and (-1) * weight(i) for "rare" votes. If the result is positive, mark the word as common.
Now, the overall idea is to evaluate the classification and increase the weight of experts that were right and decrease the weight of the experts that were wrong. Then repeat the process again and again, until your evaluation is good enough.
The specifics of the weight adjustment depends on the way how you evaluate the classification. For example, if you don't have per-word evaluation, you may still evaluate the classification as "too many common" or "too many rare" words. In the first case, promote all the pro-"rare" experts and demote all pro-"common" experts, or vice-versa.
Your distribution is most likely a Pareto distribution (a superset of Zipf's law as mentioned above). I am shocked that the most common word is used only 100 times - this is including "a" and "the" and words like that? You must have a small corpus if that is the same.
Anyways, you will have to choose a cutoff for "rare" and "common". One potential choice is the mean expected number of appearances (see the linked wiki article above to calculate the mean). Because of the "fat tail" of the distribution, a fairly small number of words will have appearances above the mean -- these are the "common". The rest are "rare". This will have the effect that many more words are rare than common. Not sure if that is what you are going for but you can just move the cutoff up and down to get your desired distribution (say, all words with > 50% of expected value are "common").
While this is not an answer to your question, you should know that you are inventing a wheel here.
Information Retrieval experts have devised ways to weight search words according to their frequency. A very popular weight is TF-IDF, which uses a word's frequency in a document and its frequency in a corpus. TF-IDF is also explained here.
An alternative score is the Okapi BM25, which uses similar factors.
See also the Lucene Similarity documentation for how TF-IDF is implemented in a popular search library.

How do you Index Files for Fast Searches?

Nowadays, Microsoft and Google will index the files on your hard drive so that you can search their contents quickly.
What I want to know is how do they do this? Can you describe the algorithm?
The simple case is an inverted index.
The most basic algorithm is simply:
scan the file for words, creating a list of unique words
normalize and filter the words
place an entry tying that word to the file in your index
The details are where things get tricky, but the fundamentals are the same.
By "normalize and filter" the words, I mean things like converting everything to lowercase, removing common "stop words" (the, if, in, a etc.), possibly "stemming" (removing common suffixes for verbs and plurals and such).
After that, you've got a unique list of words for the file and you can build your index off of that.
There are optimizations for reducing storage, techniques for checking locality of words (is "this" near "that" in the document, for example).
But, that's the fundamental way it's done.
Here's a really basic description; for more details, you can read this textbook (free online): http://informationretrieval.org/¹
1). For all files, create an index. The index consists of all unique words that occur in your dataset (called a "corpus"). With each word, a list of document ids is associated; each document id refers to a document that contains the word.
Variations: sometimes when you generate the index you want to ignore stop words ("a", "the", etc). You have to be careful, though ("to be or not to be" is a real query composed of stopwords).
Sometimes you also stem the words. This has more impact on search quality in non-English languages that use suffixes and prefixes to a greater extent.
2) When a user enters a query, look up the corresponding lists, and merge them. If it's a strict boolean query, the process is pretty straightforward -- for AND, a docid has to occur in all the word lists, for OR, in at least one wordlist, etc.
3) If you want to rank your results, there are a number of ways to do that, but the basic idea is to use the frequency with which a word occurs in a document, as compared to the frequency you expect it to occur in any document in the corpus, as a signal that the document is more or less relevant. See textbook.
4) You can also store word positions to infer phrases, etc.
Most of that is irrelevant for desktop search, as you are more interested in recall (all documents that include the term) than ranking.
¹ previously on http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html, accessible via wayback machine
You could always look into something like Apache Lucene.
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Resources