I am classifying text using the bag of words model. I read in 800 text files, each containing a sentence.
The sentences are then represented like this:
[{"OneWord":True,"AnotherWord":True,"AndSoOn":True},{"FirstWordNewSentence":True,"AnSoOn":True},...]
How many dimensions does my data have?
Is it the number of entries in the largest vector? Or is it the number of unique words? Or something else?
For each doc, the bag of words model has a set of sparse features. For example (use your first sentence in your example):
OneWord
AnotherWord
AndSoOn
The above three are the three active features for the document. It is sparse because we never list those inactive features explicitly AND we have a very large vocabulary (all possible unique words that you consider as features). In another words, we did not say:
OneWord
AnotherWord
AndSoOn
FirstWordNewSentence: false
We only include those words that are "true".
How many dimensions does my data have?
Is it the number of entries in the largest vector? Or is it the number of unique words? Or something else?
If you stick with the sparse feature representation, you might want to estimate the average number of active features per document instead. That number is 2.5 in your example ((3+2)/2 = 2.5).
If you use a dense representation (e.g., one-hot encoding, it is not a good idea though if the vocabulary is large), the input dimension is equal to your vocabulary size.
If you use a word embedding that has 100-dimension and combine all words' embedding to form a new input vector to represent a document, your input dimension is 100 then. In this case, you convert your sparse features into dense features via the embedding.
Related
I have around 1000 pairs of sentences. Each pair consists of two sentences, one causing high CTR and one LOW. I want to create a mechanism to auto-produce sentences optimized for high CTR. When iterating the pairs, I can get a vector (using Spacy NLP) for each sentence. I take the vectors difference (Sent1.vector - Sent2.Vector) and then mean all the pairs using numpy mean. When I have the "difference vector" in hand, I want to add it to any given text and get a new sentence. any Ideas how to obtain this? Gensim most_similar only works on single words... Thanks
I am training multiple word2vec models with Gensim. Each of the word2vec will have the same parameter and dimension, but trained with slightly different data. Then I want to compare how the change in data affected the vector representation of some words.
But every time I train a model, the vector representation of the same word is wildly different. Their similarity among other words remain similar, but the whole vector space seems to be rotated.
Is there any way I can rotate both of the word2vec representation in such way that same words occupy same position in vector space, or at least they are as close as possible.
Thanks in advance.
That the locations of words vary between runs is to be expected. There's no one 'right' place for words, just mutual arrangements that are good at the training task (predicting words from other nearby words) – and the algorithm involves random initialization, random choices during training, and (usually) multithreaded operation which can change the effective ordering of training examples, and thus final results, even if you were to try to eliminate the randomness by reliance on a deterministically-seeded pseudorandom number generator.
There's a class called TranslationMatrix in gensim that implements the learn-a-projection-between-two-spaces method, as used for machine-translation between natural languages in one of the early word2vec papers. It requires you to have some words that you specify should have equivalent vectors – an anchor/reference set – then lets other words find their positions in relation to those. There's a demo of its use in gensim's documentation notebooks:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb
But, there are some other techniques you could also consider:
transform & concatenate the training corpuses instead, to both retain some words that are the same across all corpuses (such as very frequent words), but make other words of interest different per segment. For example, you might leave words like "hot" and "cold" unchanged, but replace words like "tamale" or "skiing" with subcorpus-specific versions, like "tamale(A)", "tamale(B)", "skiing(A)", "skiing(B)". Shuffle all data together for training in a single session, then check the distances/directions between "tamale(A)" and "tamale(B)" - since they were each only trained by their respective subsets of the data. (It's still important to have many 'anchor' words, shared between different sets, to force a correlation on those words, and thus a shared influence/meaning for the varying-words.)
create a model for all the data, with a single vector per word. Save that model aside. Then, re-load it, and try re-training it with just subsets of the whole data. Check how much words move, when trained on just the segments. (It might again help comparability to hold certain prominent anchor words constant. There's an experimental property in the model.trainables, with a name ending _lockf, that lets you scale the updates to each word. If you set its values to 0.0, instead of the default 1.0, for certain word slots, those words can't be further updated. So after re-loading the model, you could 'freeze' your reference words, by setting their _lockf values to 0.0, so that only other words get updated by the secondary training, and they're still bound to have coordinates that make sense with regard to the unmoving anchor words. Read the source code to better understand how _lockf works.)
I'm working on a problem that involves reconciling data that represents estimates of the same system under two different classification hierarchies. I want to enforce the requirement that equivalent classes or groups of classes have the same sum.
For example, say Classification A divides industries into: Agriculture (sheep/cattle), Agriculture (non-sheep/cattle), Mining, Manufacturing (textiles), Manufacturing (non-textiles), ...
Meanwhile, Classification B has a different breakdown: Agriculture, Mining (iron ore), Mining (non-iron-ore), Manufacturing (chemical), Manufacturing (non-chemical), ...
In this case, any total for A_Agric_SheepCattle + A_Agric_NonSheepCattle should match the equivalent total for B_Agric; A_Mining should match B_MiningIronOre + B_Mining_NonIronOre; and A_MFG_Textiles+A_MFG_NonTextiles should match B_MFG_Chemical+B_MFG_NonChemical.
For bonus complication, one category may be involved in multiple equivalencies, e.g. B_Mining_IronOre might be involved in an equivalency with both A_Mining and A_Mining_Metallic.
I will be working with multi-dimensional tables, with this sort of concordance applied to more than one dimension - e.g. I might be compiling data on Industry x Product, so each equivalency will be used in multiple constraints; hence I need an efficient way to define them once and invoke repeatedly, instead of just setting a direct constraint "A_Agric_SheepCattle + A_Agric_NonSheepCattle = B_Agric".
The most natural way to represent this sort of concordance would seem to be as a list of pairs of sets. The catch is that the set sizes will vary - sometimes we have a 1:1 equivalence, sometimes it's "these 5 categories equate to those 7 categories", etc.
I found this related question which offers two answers for dealing with variable-sized sets. One is to define all set members in a single ordered set with indices, then define the starting index for each set within that. However, this seems unwieldy for my problem; both classifications are likely to be long, so I'd need to be hopping between two loooong lists of industries and two looong lists of indices to see a single equivalency. This seems like it would be a nuisance to check, and hard to modify (since any change to membership for one of the early sets changes the index numbers for all following sets).
The other is to define pairs of long fixed-length sets, and then pad each set to the required length with null members.
This would be a much better option for my purposes since it lets me eyeball a single line and see the equivalence that it represents. But it would require a LOT of padding; most of the equivalence groups will be small but a few might be quite large, and everything has to be padded to the size of the largest expected length.
Is there a better approach?
I am building a document classifier to categorize documents.
So first step is to represent each documents as "features vector" for the training purpose.
After some research, I found that I can use either the Bag of Words approach or N-gram approach to represent a document as a vector.
The text in each document (scanned pdfs and images) is retrieved using an OCR, thus some words contain errors. And I don't have previous knowledge about the language used in these documents (can't use stemming).
So as far as I understand I have to use the n-gram approach. or are there other approaches to represent a document ?
I would also appreciate if someone could link me to an N-Gram guide in order to have a clearer picture and understand how it works.
Thanks in Advance
Use language detection to get document's language (my favorite tool is LanguageIdentifier from Tika project, but many others are available).
Use spell correction (see this question for some details).
Stem words (if you work in Java environment, Lucene is your choice).
Collect all N-grams (see below).
Make instances for classification by extracting n-grams from particular documents.
Build classifier.
N-gram models
N-grams are just sequences of N items. In classification by topic you normally use N-grams of words or their roots (though there are models based on N-grams of chars). Most popular N-grams are unigrams (just word), bigrams (2 serial words) and trigrams (3 serial words). So, from sentence
Hello, my name is Frank
you should get following unigrams:
[hello, my, name, is, frank] (or [hello, I, name, be, frank], if you use roots)
following bigrams:
[hello_my, my_name, name_is, is_frank]
and so on.
At the end your feature vector should have as much positions (dimensions) as there are words in all your text plus 1 for unknown words. Every position in instance vector should somehow reflect number of corresponding words in instance text. This may be number of occurrences, binary feature (1 if word occurs, 0 otherwise), normalized feature or tf-idf (very popular in classification by topic).
Classification process itself is the same as for any other domain.
I've seen some questions on deciding the best OCR result given output from different engines, and the answer is typically "choose the best engine".
I want, however, to capture several frames of text images, with possible temporary occlusions or temporary failures.
I'm using tesseract-ocr with python-tesseract.
Considering the OCR outputs of the last N frames, I want to decide what is the best result (line by line, for simplicity).
For example, for N=3, we could use a median filtering:
ABXD
XBCX
AXCD
When there are 2 out of 3 equal characters, the majority will win, so the result would be ABCD.
However, that's not so easy with different string sizes. If I expect a given size M (if scanning a price table, the rows are typically XX.XX), I can always penalize on strings bigger than M.
If we were talking numbers, a median filtering would work quite well (simple background subtraction in computer vision), or some least mean squares adaptive filtering.
There's also the problem of similar characters: l and 1 can be very similar, depending on the font.
I was also thinking of using string distances between each string. For example, choose the string with the smallest sum of distances with the others.
Has anyone addressed this kind of problem before? Is there any known algorithm for this kind of problem that I should know?
This problem is called multiple sequence alignment and you can read about it here