What exactly is an n Gram? - sentiment-analysis

I found this previous question on SO: N-grams: Explanation + 2 applications. The OP gave this example and asked if it was correct:
Sentence: "I live in NY."
word level bigrams (2 for n): "# I', "I live", "live in", "in NY", 'NY #'
character level bigrams (2 for n): "#I", "I#", "#l", "li", "iv", "ve", "e#", "#i", "in", "n#", "#N", "NY", "Y#"
When you have this array of n-gram-parts, you drop the duplicate ones and add a counter for each part giving the frequency:
word level bigrams: [1, 1, 1, 1, 1]
character level bigrams: [2, 1, 1, ...]
Someone in the answer section confirmed this was correct, but unfortunately I'm a bit lost beyond that as I didn't fully understand everything else that was said! I'm using LingPipe and following a tutorial which stated I should choose a value between 7 and 12 - but without stating why.
What is a good nGram value and how should I take it into account when using a tool like LingPipe?
Edit: This was the tutorial: http://cavajohn.blogspot.co.uk/2013/05/how-to-sentiment-analysis-of-tweets.html

Usually a picture is worth thousand words.
Source: http://recognize-speech.com/language-model/n-gram-model/comparison

N-grams are simply all combinations of adjacent words or letters of length n that you can find in your source text. For example, given the word fox, all 2-grams (or “bigrams”) are fo and ox. You may also count the word boundary – that would expand the list of 2-grams to #f, fo, ox, and x#, where # denotes a word boundary.
You can do the same on the word level. As an example, the hello, world! text contains the following word-level bigrams: # hello, hello world, world #.
The basic point of n-grams is that they capture the language structure from the statistical point of view, like what letter or word is likely to follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

An n-gram is a n-tuple or group of n words or characters (grams, for pieces of grammar) which follow one another. So an n of 3 for the words from your sentence would be like "# I live", "I live in", "live in NY", "in NY #". This is used to create an index of how often words follow one another. You can use this in a Markov Chain to create something that will be similar to language. As you populate a mapping of the distributions of word groups or character groups, you can recombine them with the probability that the output will be close to natural, the longer the n-gram is.
Too high of a number, and your output will be a word for word copy of the original, too low of a number, and the output will be too messy.

Related

How to use machine learning to count words in text

Question:
Given a piece of text like "This is a test"; how to build a machine learning model to get the number of word occurrences for example in this piece, word count is 4. After training, it is possible to predict text word count.
I know it is easy to write a program (like below pseudo code),
data: memory.punctuation['~', '`', '!', '#', '#', '$', '%', '^', '&', '*', ...]
f: count.word(text) -> count =
f: tokenize(text) --list-->
f: count.token(list, filter) where filter(token)<not in memory.punctuation> -> count
however in this question, we require to use machine learning algorithm. I wonder how machine can learn the concept of count (currently, we know machine learning is good at classification). Any idea and suggestions? Thanks in advance.
Failures:
We can use sth like word2vec (encoder) to build word vectors; if we consider seq2seq approach, we can train sth like This is a test <s> 4 <e> This is very very long sentence and the word count is greater than ten <s> 4 1 <e> (4 1 to represent the number 14). However, it does not work since attention model is used to get similar vector for example text translating (This is a test --> 这(this) 是(is) 一个(a) 测试(test)). It is hard to find relationship between [this ...] and 4 which is an aggregated number (i.e. model not convergent).
We know machine learning is good at classification. If we treat "4" as a class, the number of classes is infinite; if we do a tricky and use count/text.length as prediction, i have not got a model that fit even training data set (model not convergent); for example, if we use many short sentence to train the model, it will fail to predict long sentence length. And it may be related to an information paradox: we can encode data in a book as 0.x and use a machine to to mark a position on a rod to split it into 2 parts with length a and b, where a/b = 0.x; but we cannot find a machine.
What about a regression problem?
I think it would work quite well and that at the end it will output a nearly whole numbers all the time.
Also you can train a simple RNN to do the job, assuming you are using a hot one encoding and take an output from the last state.
If V_h is all zeros but the space index (which will be 1) and V_x as well, than the network will actually sum the spaces, and if c is 1 at the end so the output will be the number of words - For every length!
I think we can take it as a classification problem for a character being the input and if word breaker as the output.
In other words, at some time point t, we output whether the input character at the same time point is a word breaker (YES) or not (NO). If yes, then increase the word count. If no, then read the next character.
In modern English language I don't think there are going to be long words. So simple RNN model should do perhaps without the concern of vanishing gradient.
Let me know what you think!
Use NLTK for counting words,
from nltk.tokenize import word_tokenize
text = "God is Great!"
word_count = len(word_tokenize(text))
print(word_count)

In BCPL what does "of" do?

I am trying to understand some ancient code from a DEC PDP10 written in BCPL. A sample of the code is as follows:
test scanner()=S.DOTNAME then
$( word1:=checklook.up(scan.info,S.SFUNC,"unknown Special function [:s]")
D7 of temp:=P1 of word1
scanner()
$) or D7 of temp:=SF.ACTION
What do the "D7 of temp" and "P1 of word1" constructs do in this case?
The unstoppable Martin Richards is continuing to add features to the BCPL language(a), despite the fact that so few people are aware of it(b). Only seven or so questions are tagged bcpl on Stack Overflow but don't get me wrong: I liked this language and I have fond memories of using it back in the '80s.
Some of the things added since the last time I used it are the sub-field operators SLCT and OF. As per the manual on Martin's own site:
An expression of the form K OF E accesses a field of consecutive bits in memory. K must be a manifest constant equal to SLCT length:shift:offset and E must yield a pointer, p say.
The field is contained entirely in the word at position p + offset. It has a bit length of length and is shift bits from the right hand end of the word. A length of zero is interpreted as the longest length possible consistent with shift and the word length of the implementation.
Hence it's a more fine-grained way of accessing parts of memory than just the ! "dereference entire word" operator in that it allows you to get at specific bits within a word.
(a) Including, apparently, a version for the Raspberry PI, which may finally give me an excuse to break out all those spare PIs I have lying around, and educate the kids about the "good old days".
(b) It was used for at least one MC6809 embedded system I worked on, and formed a non-trivial part of AmigaDOS many moons ago.

How can I replace all non-words in a phrase, with the exception of numbers followed or preceded by characters?

Let us take a ruby array of sentences. Within the array we have
Sentences containing only words
Sentences containing phone numbers
Sentences containing numeric values with units of measurement
In this case we may have things that look like this: 1mL, 55mL, 1 mL, etc
Sentences containing quantities denoted as 1x or 5 x.
I am trying to construct a ruby regexp for the gsub or scan functions, such that I clean up the above sentences array to only be left with the words (1), units of measurement (3), and quantities (4) in each sentence, but clean up all non-word characters, such as phone numbers (2) and any other delimiting characters such as \t.
I've got this so far:
sentences.map do |sentence|
sentence.gsub!(/(?:(\d+)(?:[xX])|([xX])(?:\d+)[^a-zA-Z ])/, "")
end
Unfortunately, that replaces the exact opposite of what I want to replace. And, it doesn't account for cases where units of measurement are what I want to preserve at all.
Example inputs and outputs:
input: Lavender top (6 mL size preferred)
output: Lavender top (6 mL size preferred)
input: Blood & bone marrow aspirate: 15 mL centrifuge tube with transport media. Available from Cytogenetics, 415-123-4567.
output: Blood & bone marrow aspirate: 15 mL centrifuge tube with transport media. Available from Cytogenetics, .
input: Gold top x1, Lt. Green top x 1, Lavender top x1
output: Gold top x1, Lt. Green top x 1, Lavender top x1
So, effectively, replace numbers and other non-alpha characters, but only when the numbers don't denote measurements or quantities
I've been playing on rubular for about 3 hours to no avail. I think I might be misunderstanding look-aheads completely or just missing one key gotcha moment.
Looking forward to the regexp experts chiming in!
This could perhaps be a start:
input.map!{|x| x.gsub(/(?<!x\s|x)[\d-]+(?!\s?\w\w?)/i, '')}
#/(?<!x\s|x)[\d-]+(?!\s?\w\w)/i
# (?<!x\s|x) Dont match if after an x or x+space
# [\d-]+ Match digits (and other junk)
# (?!\s?\w\w) Make sure it is not followed by a two letter word. Here you could be more specific if it causes trouble.
# /expression/i make the thing case insensitive.
This works on your sample data, but there may be other cases not taken care of:
(?<!x\s?)\b[-.\d]+\b(?!\s*?ml)
The regex only matches the 415-123-4567 in your sample data.

Algorithm for matching cards to a set of rules

I've run into a peculiar problem which I don't seem to be able to wrap my head around. I'll get right into it.
The problem is matching a set of cards to a set of rules.
It is possible to define a set of rules as a string. It is composed of comma separated tuples of <suit>:<value>. For example H:4, S:1 should match Four of Hearts and Ace of Spades. It is also possible to wildcard, for example *:* matches any card, D:* matches any card with in the diamond suit, and *:2matches a Two in any suit. Rules can be combined with comma: *:*,*:*,H:4 would match a set of cards if it held 2 random cards and a Four of Hearts.
So far so good. A parser for this is easy and straight forward to write. Here comes the tricky part.
To make it easy to compose these rules, two more constructions can be used for suit and value. These are < (legal for suit and value) and +n (legal only for value) where n is a number. < means "the same as previous match" and +n means "n higher than previous match". An example:
*:*, <:*, *:<
Means: match any card, then match a card with the same suit as the first match, next match another card with the same value as the second match. This hand would match:
H:4,H:8,C:8
Because Hearts of Four and Hearts of Eight is the same suit, while Eight of Hearts and Eight of Clubs is the same value.
It is allowed to have more cards as long as all rules match (so, adding C:10 to the above hand would still match the rule).
My first approach at solving this is basically taking the set of cards which should be matched, attempting to apply the first rule to it. If it matched, I moved on to the next rule and attempted to match it from the set of cards, and so on until either all rules were matched, or I found a rule that didn't match. This approach have (at least) one flaw, consider example above above: *:*,<:*,*:<, but with the cards in this order: H:8,C:8,H:4.
It would match the H:8 of for the first rule. Matched: H:8
Next it attempts to find one with the same suit (Hearts). There is a Four of Hearts. Matched: H:8, H:4
Moving on, it want to find a card with the same value (Four), and fails.
I don't want the way the set of cards is ordered to have any impact on the result as it does in the above example. I could sort the set of cards if I could think of any great strategy that worked well with any set of rules.
I have no knowledge of the quantity of cards or number oof rules, so a brute force approach is not feasible.
Thank you for reading this far, I am grateful for any tip or insight.
Your problem is actually an ordering problem. Here's a simple version for it:
given an input sequence of numbers and a pattern, reorder them so that they fit the pattern. The pattern can contain "*", meaning "any number" and ">", meaning "bigger than the previous number.
For example, given pattern [* * > >] and sequence [10 10 2 1] such an ordering exists and it is [10 1 2 10]. Some inputs might give no outputs, others 1, while even others many (think the input [10 10 2 1] and the pattern [* * * *]).
I'd say that once you have the solution for this simplified problem, switching to your problem is just a matter of adding another dimension and some operators. Sorry for not being of more help :/ .
LE. keep in mind that if the allowed character symbols are finite (i.e. 4) and also the allowed numbers (i.e. 9) things might get easier.

How to neglect the output of OCR Engine that has no meaning?

Tesseract OCR engine sometimes outputs text that has no meaning, i want to design an algorithm that neglects any text or word that has no meaning, below is some sort of output text that i want to neglect,my simple solution is to count the words in the recognized text that's separated by " " and the text which has too many words will be garbage(Hint: i'm scanning images which at most will contains 40 words) any idea will be helpful,thanks.
wo:>"|axnoA1wvw\
ldflfig
°J!9O‘ !P99W M9N 6 13!-|15!Cl ‘I-/Vl
978 89l9 Z0 3+ 3 'l9.l.
97 999 VLL lLOZ+ 3 9l!q°lN
wo0'|axno/(#|au1e>1e: new;
1=96r2a1ey\1 1uauud0|e/\e(]
|8UJB){ p8UJL|\7'
Divide the output text into words. Divide the words into triples. Count the triple frequencies, and compare to triple frequencies from text of a known-good text corpus (EG all the articles from some mailing list discussing what you intend to OCR, minus the header lines).
When I say "triples", I mean:
whe, hen, i, say, tri, rip, ipl, ple, les, i, mea, ean
...so "i" has a frequency of 2 in this short example, while the others are all frequency 1.
If you do a frequency count of each of these triples for a large document in your intended language, it should become possible to be reasonably accurate in guessing whether a string is in the same language.
Granted, it's heuristic.
I've used a similar approach for detecting English passwords in a password changing program. It worked pretty well, though there's no such thing as a perfect "obvious password rejecter".
Check the words against a dictionary?
Of course, this will have false-positives for things like foreign-phrases or code. The problem in general is intractable (ex. is this code or gibberish? :) ). The only (nearly) perfect method would be to use this as a heuristic to flag certain sections for human review.

Resources