how to embidingvector word embiding? - word-embedding

i read a paper "Integrating and Evaluating Neural Word Embeddings
in Information Retrieval".
I tried to understand a source code and when opening the file named:
vectors_ap8889_skipgram_s200_w20_neg20_hs0_sam1e-4_iter5
i found a word vector representation like this:
downtown "-0.465147 -0.049099 -0.023432 0.058986 -0.085395 -0.027324 -0.050315 ................................................"
Please; i need you to explain what do thses values mean and wich term refer in the corpus and how can i obtain it.

The numbers don't have an intrinsic meaning. It's just a n-dimensional embedding of the given words.
If the embedding is done correctly you should see similar words having embedding close together. For example "good" should be closer to "awesome" than to "island".
The common way to use it is to convert the words into their embedding space and use that as an input to some machine learning problem. The advantage is that the embedding is trained on a lot more data than you have for your problem so the embedding offer a shortcut to training your model.

Related

What algorithms can group characters into words?

I have some text generated by some lousy OCR software.
The output contains mixture of words and space-separated characters, which should have been grouped into words. For example,
Expr e s s i o n Syntax
S u m m a r y o f T e r minology
should have been
Expression Syntax
Summary of Terminology
What algorithms can group characters into words?
If I program in Python, C#, Java, C or C++, what libraries provide the implementation of the algorithms?
Thanks.
Minimal approach:
In your input, remove the space before any single letter words. Mark the final words created as part of this somehow (prefix them with a symbol not in the input, for example).
Get a dictionary of English words, sorted longest to shortest.
For each marked word in your input, find the longest match and break that off as a word. Repeat on the characters left over in the original "word" until there's nothing left over. (In the case where there's no match just leave it alone.)
More sophisticated, overkill approach:
The problem of splitting words without spaces is a real-world problem in languages commonly written without spaces, such as Chinese and Japanese. I'm familiar with Japanese so I'll mainly speak with reference to that.
Typical approaches use a dictionary and a sequence model. The model is trained to learn transition properties between labels - part of speech tagging, combined with the dictionary, is used to figure out the relative likelihood of different potential places to split words. Then the most likely sequence of splits for a whole sentence is solved for using (for example) the Viterbi algorithm.
Creating a system like this is almost certainly overkill if you're just cleaning OCR data, but if you're interested it may be worth looking into.
A sample case where the more sophisticated approach will work and the simple one won't:
input: Playforthefunofit
simple output: Play forth efunofit (forth is longer than for)
sophistiated output: Play for the fun of it (forth efunofit is a low-frequency - that is, unnatural - transition, while for the is not)
You can work around the issue with the simple approach to some extent by adding common short-word sequences to your dictionary as units. For example, add forthe as a dictionary word, and split it in a post processing step.
Hope that helps - good luck!

Does a word checking algorithm exist?

I've been thinking of if this was created already but image a function that can validate a string and determine if it's a word or not. eg
print(validateWord("Hello")) --> true
print(validateWord("Haloe")) --> true (may not be a real word but follows the standards of placements of vowels and such)
print(validateWord("sewxdw")) --> false
I'm not asking for code, I would just like knowledge of if this exists already and a wiki post to this algorithm would be nice if it did.
What you want is a hidden Markov model, trained on the words in a corpus of English (or whatever language you are interested in). You can then score putative words for whether the model likes them or not. It will only disallow actually disallowed combinations like "jx" but it should give a low score to unlikely candidates.
You might have better luck trying to break up the text into phoneme symbols (th, ae qu, ph etc) first rather than writing a model that uses raw letters.

Why does MaxentTagger tag numbers as NN sometimes?

I am trying to tag a HTML page full of space-separated numbers like "5320412185 5320412184 5320412189..." to observe how the tagger behaves with numbers. I'm using english-left3words-distsim.tagger in the constructor. I'm observing on the console that most of the numbers are tagged as CD but at times there are also numbers getting tagged as NN. I searched on the FAQ page of nlp.stanford.edu but I couldn't find this there. Can anyone help me in understanding this?
I don't know if I should need to mention this: I'm feeding each number separately to the tagger by splitting the huge input(1045000 numbers!) based on space-delimiter.
From Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision)
Sometimes, it is unclear whether one is cardinal number or a noun. In general, it should be tagged as a
cardinal number (CD) even when its sense is not clearly that of a numeral.
EXAMPLE: one/CD of the best reasons
But if it could be pluralized or modified by an adjective in a particular context, it is a common noun (NN).
EXAMPLE: the only (good) one/NN of its kind
(cf. the only (good) ones/NNS of their kind)
In the collocation another one, one should also be tagged as a common noun (NN).
Hyphenated fractions one-half, three-fourths, seven-eighths, one-and-a-half, seven-and-three-eighths should
be tagged as adjectives (JJ) when they are prenominal modifiers, but as adverbs (RB) if they could be
replaced by double or twice.
For further reading: http://repository.upenn.edu/cgi/viewcontent.cgi?article=1603&context=cis_reports

GoLang PoS Tagger script taking longer than it should with no output in terminal

This script is compling without errors in play.golang.org: http://play.golang.org/p/Hlr-IAc_1f
But when I run in on my machine, much longer than I expect happens with nothing happening in the terminal.
What I am trying to build is a PartOfSpeech Tagger.
I think the longest part is loading lexicon.txt into a map and then comparing each word with every word there to see if it has already been tagged in the lexicon. The lexicon only contains verbs. But doesn't every word need to be checked to see if it is a verb.
The larger problem is that I don't know how to determine if a word is a verb with an easy heuristic like adverbs, adjectives, etc.
(Quoting):
I don't know how to determine if a word is a verb with an easy heuristic like adverbs, adjectives, etc.
I can't speak to any issues in your Go implementation, but I'll address the larger problem of POS tagging in general. It sounds like you're attempting to build a rule-based unigram tagger. To elaborate a bit on those terms:
"unigram" means you're considering each word in the sentence separately. Note that a unigram tagger is inherently limited, in that it cannot disambiguate words which can take on multiple POS tags. E.g., should you tag 'fish' as a noun or a verb? Is 'last' a verb or an adverb?
"rule-based" means exactly what it sounds like: a set of rules to determine the tag for each word. Rule-based tagging is limited in a different way - it requires considerable development effort to assemble a ruleset that will handle a reasonable portion of the ambiguity in common language. This effort might be appropriate if you're working in a language for which we don't have good training resources, but in most common languages, we now have enough tagged text to train high-accuracy tagging models.
State-of-the-art for POS tagging is above 97% accuracy on well-formed newswire text (accuracy on less formal genres is naturally lower). A rule-based tagger will probably perform considerably worse (you'll have to determine the accuracy level needed to meet your requirements). If you want to continue down the rule-based path, I'd recommend reading this tutorial. The code is based on Haskell, but it will help you learn the concepts and issues in rule-based tagging.
That said, I'd strongly recommend you look at other tagging methods. I mentioned the weaknesses of unigram tagging. Related approaches would be 'bigram', meaning that we consider the previous word when tagging word n, 'trigram' (usually the previous 2 words, or the previous word, the current word, and the following word); more generally, 'n-gram' refers to considering a sequence of n words (often, a sliding window around the word we're currently tagging). That context can help us disambiguate 'fish', 'last', 'flies', etc.
E.g., in
We fish
we probably want to tag fish as a verb, whereas in
ate fish
it's certainly a noun.
The NLTK tutorial might be a good reference here. An solid n-gram tagger should get you above 90% accuracy; likely above 95% (again on newswire text).
More sophisticated methods (known as 'structured inference') consider the entire tag sequence as a whole. That is, instead of trying to predict the most probable tag for each word separately, they attempt to predict the most probable sequence of tags for the entire input sequence. Structured inference is of course more difficult to implement and train, but will usually improve accuracy vs. n-gram approaches. If you want to read up on this area, I suggest Sutton and McCallum's excellent introduction.
You've got a large array argument in this function:
func stringInArray(a string, list [214]string) bool{
for _, b := range list{
if b == a{
return true;
}
}
return false
}
The array of stopwords gets copied each time you call this function.
Mostly in Go, you should uses slices rather than arrays most of the time. Change the definition of this to be list []string and define stopWords as a slice rather than an array:
stopWords := []string{
"and", "or", ...
}
Probably an even better approach would be to build a map of the stopWords:
isStopWord := map[string]bool{}
for _, sw := range stopWords {
isStopWord[sw] = true
}
and then you can check if a word is a stopword quickly:
if isStopWord[word] { ... }

How to find "equivalent" texts?

I want to find (not generate) 2 text strings such that, after removing all non letters and ucasing, one string can be translated to the other by simple substitution.
The motivation for this comes from a project I known of that is testing methods for attacking cyphers via probability distributions. I'd like to find a large, coherent plain text that, once encrypted with a simple substitution cypher, can be decrypted to something else that is also coherent.
This ends up as 2 parts, find the longest such strings in a corpus, and get that corpus.
The first part seems to me to be amiable to some sort of attack with a B-tree keyed off the string after a substitution that makes the sequence of first occurrences sequential.
HELLOWORLDTHISISIT
1233454637819a9b98
A little optimization based on knowing the maximum value and length of the string based on each depth of the tree and the rest is just coding.
The Other part would be quite a bit more involved; how to generate a large corpus of text to search? some kind of internet spider would seem to be the ideal approach as it would have access to the largest amount of text but how to strip it to just the text?
The question is; Any ideas on how to do this better?
Edit: the cipher that was being used is an insanely basic 26 letter substitution cipher.
p.s. this is more a thought experiment then a probable real project for me.
There are 26! different substitution ciphers. That works out to a bit over 88 bits of choice:
>>> math.log(factorial(26), 2)
88.381953327016262
The entropy of English text is something like 2 bits per character at least. So it seems to me you can't reasonably expect to find passages of more than 45-50 characters that are accidentally equivalent under substitution.
For the large corpus, there's the Gutenberg Project and Wikipedia, for a start. You can download an dump of all the English Wikipedia's XML files from their website.
I think you're asking a bit much to generate a substitution that is also "coherent". That is an AI problem for the encryption algorithm to figure out what text is coherent. Also, the longer your text is the more complicated it will be to create a "coherent" result... quickly approaching a point where you need a "key" as long as the text you are encrypting. Thus defeating the purpose of encrypting it at all.

Resources