Porter Stemmer, Step 1b - algorithm

Similar question to this [1]porter stemming algorithm implementation question?, but expanded.
Basically, step1b is defined as:
Step1b
`(m>0) EED -> EE feed -> feed
agreed -> agree
(*v*) ED -> plastered -> plaster
bled -> bled
(*v*) ING -> motoring -> motor
sing -> sing `
My question is why does feed stem to feed and not fe? All the online Porter Stemmer's I've tried online stems to feed, but from what I see, it should stem to fe.
My train of thought is:
`feed` does not pass through `(m>0) EED -> EE` as measure of `feed` minus suffix `eed` is `m(f)`, hence `=0`
`feed` will pass through `(*v*) ED ->`, as there is a vowel in the stem `fe` once the suffix `ed` is removed. So will stem at this point to `fe`
Can someone explain to me how online Porter Stemmers manage to stem to feed?
Thanks.

It's because "feed" doesn't have a VC (vowel/consonant) combination, therefore m = 0. To remove the "ed" suffix, m > 0 (check the conditions for each step).

The rules for removing a suffix will be given in the form
(condition) S1 -> S2
This means that if a word ends with the suffix S1, and the stem before S1 satisfies the given condition, S1 is replaced by S2. The condition is usually given in terms of m, e.g.
(m > 1) EMENT ->
Here S1 is `EMENT' and S2 is null. This would map REPLACEMENT to REPLAC, since REPLAC is a word part for which m = 2.
now, in your example :
(m>0) EED -> EE feed -> feed
before 'EED', are there vowel(s) followed by constant(s), repeated more than zero time??
answer is no, befer 'EED' is "F", there are not vowel(s) followed by constant(s)

In feed m refers to vowel,consonant pair. there is no such pair.
But in agreed "VC" is ag. Hence it is replaced by agree. The condition is m>0.
Here m=0.

It's really sad that nobody here actually read the question. This is why feed doesn't get stemmed to fe by rule 2 of step 1b:
The definition of the algorithm states:
In a set of rules written beneath each other, only one is obeyed, and this
will be the one with the longest matching S1 for the given word.
It isn't clearly statet that the conditions are always ignored here, but they are. So feed does match to the first rule (but it isn't applied since the condition isn't met) and therefore the rest of the rules in 1b are ignored.
The code would approximately look like this:
// 1b
if(word.ends_with("eed")) { // (m > 0) EED -> EE
mval = getMvalueOfStem();
if(mval > 0) {
word.erase("d");
}
}
else if(word.ends_with("ed")) { // (*v*) ED -> NULL
if(containsVowel(wordStem) {
word.erase("ed");
}
}
else if(word.ends_with("ing")) { // (*v*) ING -> NULL
if(containsVowel(wordStem) {
word.erase("ing");
}
}
The important things here are the else ifs.

Related

Verifying transformed string was actually changed (according to mapping table)

I have a mapping table, M:
And using this, I've performed a find & replace on string S which gives me the transformed string S':
S: {"z" "y" "g" "k"} -> S':{"z" "y" "h" "k"}
Now I wish to verify if my mapping transformation was actually applied to S'. The psudo-code I came up for doing so is as follows:
I. Call function searchCol(x, “h”); // returns true if “h” can be found in column x in M.
II. If searchCol(x, “h”); returns true {
// assume mapping transformation was not applied to S'
// S'' after transforming S': {“z”, “y”, “i”, “j”}
}
III.If searchCol(x, “h”); returns false {
// assume mapping transformation was already applied to S'
// do nothing
}
IV. // log and continue …
However, as you can see, for the case above the algorithm doesn't work. Does anyone know a better way of going about this?
Cheers for your help.
Note: As my codebase is in Java, if you do provide any code examples, I'd prefer it if you posted them in the same language :)
Can you instead keep track of transformations? There are some cases where it's impossible to determine if a transformation took place, imagine this mapping table:
x -> y
y -> x
Now given the String yxyxyxyx, was it already transformed? And how many times?
But even if your mapping table is free of circles, the only thing you can say is:
If the string contains a char that is on the left side and not on the right side,
then it was not yet transformed.
But if the above condition is not fulfilled, then you can not be sure of anything.

Improving the Java 8 way of finding the most common words in "War and Peace"

I read this problem in Richard Bird's book: Find the top five most common words in War and Peace (or any other text for that matter).
Here's my current attempt:
public class WarAndPeace {
public static void main(String[] args) throws Exception {
Map<String, Integer> wc =
Files.lines(Paths.get("/tmp", "/war-and-peace.txt"))
.map(line -> line.replaceAll("\\p{Punct}", ""))
.flatMap(line -> Arrays.stream(line.split("\\s+")))
.filter(word -> word.matches("\\w+"))
.map(s -> s.toLowerCase())
.filter(s -> s.length() >= 2)
.collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
wc.entrySet()
.stream()
.sorted((e1, e2) -> Integer.compare(e2.getValue(), e1.getValue()))
.limit(5)
.forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));
}
}
This definitely looks interesting and runs reasonably fast. On my laptop it prints the following:
$> time java -server -Xmx10g -cp target/classes tmp.WarAndPeace
the: 34566
and: 22152
to: 16716
of: 14987
a: 10521
java -server -Xmx10g -cp target/classes tmp.WarAndPeace 1.86s user 0.13s system 274% cpu 0.724 total
It usually runs in under 2 seconds. Can you suggest further improvements to this from an expressiveness and a performance standpoint?
PS: If you are interested in the rich history of this problem, see here.
You're recompiling all the regexps on every line and on every word. Instead of .flatMap(line -> Arrays.stream(line.split("\\s+"))) write .flatMap(Pattern.compile("\\s+")::splitAsStream). The same for .filter(word -> word.matches("\\w+")): use .filter(Pattern.compile("^\\w+$").asPredicate()). The same for map.
Probably it's better to swap .map(s -> s.toLowerCase()) and .filter(s -> s.length() >= 2) in order not to call toLowerCase() for one-letter words.
You should not use Collectors.toConcurrentMap(w -> w, w -> 1, Integer::sum). First, your stream is not parallel, so you may easily replace toConcurrentMap with toMap. Second, it would probably be more efficient (though testing is necessary) to use Collectors.groupingBy(w -> w, Collectors.summingInt(w -> 1)) as this would reduce boxing (but add a finisher step which will box all the values at once).
Instead of (e1, e2) -> Integer.compare(e2.getValue(), e1.getValue()) you may use ready comparator: Map.Entry.comparingByValue() (though probably it's a matter of taste).
To summarize:
Map<String, Integer> wc =
Files.lines(Paths.get("/tmp", "/war-and-peace.txt"))
.map(Pattern.compile("\\p{Punct}")::matcher)
.map(matcher -> matcher.replaceAll(""))
.flatMap(Pattern.compile("\\s+")::splitAsStream)
.filter(Pattern.compile("^\\w+$").asPredicate())
.filter(s -> s.length() >= 2)
.map(s -> s.toLowerCase())
.collect(Collectors.groupingBy(w -> w,
Collectors.summingInt(w -> 1)));
wc.entrySet()
.stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.limit(5)
.forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));
If you don't like method references (some people don't), you may store precompiled regexps in the variables instead.
You are performing several redundant and unnecessary operations.
You first replace all punctuation characters with empty strings, creating new strings, then you perform a split operation using space characters as boundary. This even risks merging of words which are separated by punctuation without spacing. You could fix that by replacing punctuation by spaces, but in the end, you don’t need that replacement as you can change the split pattern to “punctuation or space” but
You are then filtering the split results by accepting strings solely consisting of word characters only. Since you have already removed all punctuation and spacing characters, this will sort out strings having characters that are neither word, space or punctuation characters and I'm not sure if this is the intended logic. After all, if you are interested in words only, why not search for words only in the first place? Since Java 8 does not support streams of matches, we can direct it to split using non-word characters as boundary.
Then you are doing a .map(s -> s.toLowerCase()).filter(s -> s.length() >= 2). Since for English texts, the string length won’t change when changing it to uppercase, the filtering condition is not affected, so we can filter first, skipping the toLowerCase conversion for strings that are not accepted by the predicate: .filter(s -> s.length() >= 2).map(s -> s.toLowerCase()). The net benefit might be small, but it doesn’t hurt.
Choosing the right Collector. Tagir already explained it. In principle, there’s Collectors.counting() which fits better than Collectors.summingInt(w->1), but unfortunately, Oracle’s current implementation is poor as it is based on reduce, unboxing and reboxing Longs for all elements.
Putting it all together, you’ll get:
Files.lines(Paths.get("/tmp", "/war-and-peace.txt"))
.flatMap(Pattern.compile("\\W+")::splitAsStream)
.filter(s -> s.length() >= 2)
.map(String::toLowerCase)
.collect(Collectors.groupingBy(w->w, Collectors.summingInt(w->1)))
.entrySet()
.stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.limit(5)
.forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));
As explained, don’t be surprised if the word counts are slightly higher than in your approach.

opennlp chunker and postag results

Java - opennlp
I am new to opennlp and i am try to analyze the sentence and have the post tag and chunk result but I could not understand the values meaning. Is there any table which can explain the post tag and chunk result values full form meaning ?
Tokens: [My, name, is, Chris, corrale, and, I, live, in, Philadelphia, USA, .]
Post Tags: [PRP$, NN, VBZ, NNP, NN, CC, PRP, VBP, IN, NNP, NNP, .]
chunk Result: [B-NP, I-NP, B-VP, B-NP, I-NP, O, B-NP, B-VP, B-PP, B-NP, I-NP, O]
The POS tags are from the Penn Treebank tagset. The chunks are noun phrases (NP), verb phrases (VP), and prepositions (PP). "B-.." marks the beginning of such a phrase, "I-.." means something like "inner", i.e. the phrase continues here (see OpenNLP docs).
S -> Simple declarative clause, i.e. one that is not introduced by a
(possible empty) subordinating
conjunction or a wh-word and that does not exhibit subject-verb
inversion.
SBAR -> Clause introduced by a (possibly empty) subordinating conjunction.
SBARQ -> Direct question introduced by a wh-word or a wh-phrase.
Indirect questions and relative clauses should be bracketed as
SBAR, not SBARQ.
SINV -> Inverted declarative sentence, i.e. one in which the subject
follows the tensed verb or modal.
SQ -> Inverted yes/no question, or main clause of a wh-question, following
the wh-phrase in SBARQ.
ADJP -> Adjective Phrase.
ADVP -> Adverb Phrase.
CONJP -> Conjunction Phrase.
FRAG -> Fragment.
INTJ -> Interjection. Corresponds approximately to the part-of-speech tag
UH.
LST -> List marker. Includes surrounding punctuation.
NAC -> Not a Constituent; used to show the scope of certain prenominal
modifiers within an NP.
NP -> Noun Phrase.
NX -> Used within certain complex NPs to mark the head of the NP.
Corresponds very roughly to N-bar
PP -> Prepositional Phrase.
PRN -> Parenthetical.
PRT -> Particle. Category for words that should be tagged RP.
QP -> Quantifier Phrase (i.e. complex measure/amount phrase); used within
NP.
RRC -> Reduced Relative Clause.
UCP -> Unlike Coordinated Phrase.
VP -> Verb Phrase.
WHADJP -> Wh-adjective Phrase. Adjectival phrase containing a wh-adverb, as
in how hot.
WHAVP -> Wh-adverb Phrase. Introduces a clause with an NP gap. May be null
(containing the 0 complementizer)
or lexical, containing a wh-adverb such as how or why.
WHNP -> Wh-noun Phrase. Introduces a clause with an NP gap. May be null
(containing the 0 complementizer)
or lexical, containing some wh-word, e.g. who, which book, whose
daughter, none of which, or how
many leopards.
WHPP -> Wh-prepositional Phrase. Prepositional phrase containing a wh-noun
phrase
(such as of which or by whose authority) that either introduces a
PP gap or is contained by a WHNP.
X -> Unknown, uncertain, or unbracketable. X is often used for bracketing
typos and in bracketing
the...the-constructions.
Credit: http://mail-archives.apache.org/mod_mbox/opennlp-users/201402.mbox/%3CCACQuOSXOeyw2O-AZtW3m=iABo1=3cpZOdPiWFXoNwN-SVWo4gQ#mail.gmail.com%3E
Please refer the POSTag list to get the tags details.
Chunk tags like "B-NP" are made up of two or three parts:
First part:
B - marks the beginning of a chunk
I - marks the continuation of a chunk
E - marks the end of a chunk
As a chunk,it may be only one word long (like "She" in the example above), it can be both beginning and end of a chunk at the same time.
Second part:
NP - noun chunk
VP - verb chunk
For more reference you can refer the OpenNLP Documentation.

Production rules for a grammar

Before anything, yes, this is from coursework and I've been at it sporadically while dealing with another project.
A language consists of those strings (of terminals 'a' and 'b') where the number of a = number of b. Trying to find the production rules of the grammar that will define the above language.
More formally, L(G) = {w | Na(w) = Nb(w)}
So i guess it should go something like, L = {ϵ, ab, aabb, abab, abba, bbaa, ... and so on }
Any hints, or even related problems with solution would do which might help me better grasp the present problem.
I think this is it:
S -> empty (1)
S -> aSb (2)
S -> bSa (3)
S -> SS (4)
Edit: I changed the rules. Now here's how to produce bbaaabab
S ->(4) SS ->(4) SSS ->(3) bSaSS ->(3) bbSaaSS -> (1)bbaaSS
->(2) bbaaaSbS ->(2) bbaaaSbaSb ->(1)bbaaabaSb ->(1) bbaaabab
Another hint: Write all your production rules such that they guarantee Na(w) = Nb(w) at every step.

How does pattern matching work behind the scenes in F#?

I am completely new to F# (and functional programming in general) but I see pattern matching used everywhere in sample code. I am wondering for example how pattern matching actually works? For example, I imagine it working the same as a for loop in other languages and checking for matches on each item in a collection. This is probably far from correct, how does it actually work behind the scenes?
How does pattern matching actually work? The same as a for loop?
It is about as far from a for loop as you could imagine: instead of looping, a pattern match is compiled to an efficient automaton. There are two styles of automaton, which I call the "decision tree" and the "French style." Each style offers different advantages: the decision tree inspects the minimum number of values needed to make a decision, but a naive implementation may require exponential code space in the worst case. The French style offers a different time-space tradeoff, with good but not optimal guarantees for both.
But the absolutely definitive work on this problem is Luc Maranget's excellent paper "Compiling Pattern Matching to Good Decisions Trees from the 2008 ML Workshop. Luc's paper basically shows how to get the best of both worlds. If you want a treatment that may be slightly more accessible to the amateur, I humbly recommend my own offering When Do Match-Compilation Heuristics Matter?
Writing a pattern-match compiler is easy and fun!
It depends on what kind of pattern matching do you mean - it is quite powerful construct and can be used in all sorts of ways. However, I'll try to explain how pattern matching works on lists. You can write for example these patterns:
match l with
| [1; 2; 3] -> // specific list of 3 elements
| 1::rest -> // list starting with 1 followed by more elements
| x::xs -> // non-empty list with element 'x' followed by a list
| [] -> // empty list (no elements)
The F# list is actually a discriminated union containing two cases - [] representing an empty list or x::xs representing a list with first element x followed by some other elements. In C#, this might be represented like this:
// Represents any list
abstract class List<T> { }
// Case '[]' representing an empty list
class EmptyList<T> : List<T> { }
// Case 'x::xs' representing list with element followed by other list
class ConsList<T> : List<T> {
public T Value { get; set; }
public List<T> Rest { get; set; }
}
The patterns above would be compiled to the following (I'm using pseudo-code to make this simpler):
if (l is ConsList) && (l.Value == 1) &&
(l.Rest is ConsList) && (l.Rest.Value == 2) &&
(l.Rest.Rest is ConsList) && (l.Rest.Rest.Value == 3) &&
(l.Rest.Rest.Rest is EmptyList) then
// specific list of 3 elements
else if (l is ConsList) && (l.Value == 1) then
var rest = l.Rest;
// list starting with 1 followed by more elements
else if (l is ConsList) then
var x = l.Value, xs = l.Rest;
// non-empty list with element 'x' followed by a list
else if (l is EmptyList) then
// empty list (no elements)
As you can see, there is no looping involved. When processing lists in F#, you would use recursion to implement looping, but pattern matching is used on individual elements (ConsList) that together compose the entire list.
Pattern matching on lists is a specific case of discriminated union which is discussed by sepp2k. There are other constructs that may appear in pattern matching, but essentially all of them are compiled using some (complicated) if statement.
No, it doesn't loop. If you have a pattern match like this
match x with
| Foo a b -> a + b
| Bar c -> c
this compiles down to something like this pseudo code:
if (x is a Foo)
let a = (first element of x) in
let b = (second element of x) in
a+b
else if (x is a Bar)
let c = (first element of x) in
c
If Foo and Bar are constructors from an algebraic data type (i.e. a type defined like type FooBar = Foo int int | Bar int) the operations x is a Foo and x is a Bar are simple comparisons. If they are defined by an active pattern, the operations are defined by that pattern.
If you compile your F# code to an .exe then take a look with Reflector you can see what the C# "equivalent" of the F# code.
I've used this method to look at F# examples quite a bit.

Resources