One node may have more than two children in the parsing tree obtained from the Stanford Parser, such as the englishPCFG.ser.gz
How can I obtain a binarized parsing tree with POS tagging information on each node?
Is there any parameters to be filled into the parser to achieve this?
The trees are not strictly binary-branching because the Penn treebank on which the parser was trained isn't. This is a theoretical problem with the (now ancient) treebank that continues to bedevil computational linguists!
The way in which I've dealt with this is by writing complex tree-transformation logic that restructures the output of the constituency parser as binary-branching structures, using X-bar-theoretic representations -- in the process promoting functional projections over lexical phrases, raising quantifiers and so on.
[Update] I tried the TreeBinarizer class. It worked well on the one example I used. I'm parsing Spanish, and using Clojure. Here's a sample session:
user=> (import edu.stanford.nlp.parser.lexparser.TreeBinarizer)
edu.stanford.nlp.parser.lexparser.TreeBinarizer
user=> (import edu.stanford.nlp.trees.international.spanish.SpanishTreebankLanguagePack)
edu.stanford.nlp.trees.international.spanish.SpanishTreebankLanguagePack
user=> (import edu.stanford.nlp.trees.international.spanish.SpanishHeadFinder)
edu.stanford.nlp.trees.international.spanish.SpanishHeadFinder
user=> ; I have a parsed tree:
user=> (.pennPrint t)
(sp
(prep (sp000 a))
(S
(infinitiu (vmn0000 decir))
(S
(conj (cs que))
(grup.verb (vaip000 hemos) (vmp0000 visto))
(sn
(spec (di0000 un))
(grup.nom (nc0s000 relámpago))))))
nil
user=> ; let's create a binarizer
user=> (def tb (TreeBinarizer/simpleTreeBinarizer (SpanishHeadFinder.) (SpanishTreebankLanguagePack.)))
#'user/tb
user=> ; now transform the tree above -- note that the second embedded S node has three children
user=> (.pennPrint (.transformTree tb t))
(sp
(prep (sp000 a))
(S
(infinitiu (vmn0000 decir))
(S
(conj (cs que))
(#S
(grup.verb (vaip000 hemos) (vmp0000 visto))
(sn
(spec (di0000 un))
(grup.nom (nc0s000 relámpago)))))))
nil
user=> ; the binarizer created an intermediate phrasal node #S, pushing the conjuction into <Spec, #S>
Related
My code is spending most of its time scoring bisections: determining how many edges of a graph cross from one set of nodes to the other.
Assume bisect is a set of half of a graph's nodes (ints), and edges is a list of (directed) edges [ [n1 n2] ...] where n1,n2 are also nodes.
(defn tstBisectScore
"number of edges crossing bisect"
([bisect edges]
(tstBisectScore bisect 0 edges))
([bisect nx edge2check]
(if (empty? edge2check)
nx
(let [[n1 n2] (first edge2check)
inb1 (contains? bisect n1)
inb2 (contains? bisect n2)]
(if (or (and inb1 inb2)
(and (not inb1) (not inb2)))
(recur bisect nx (rest edge2check))
(recur bisect (inc nx) (rest edge2check))))
)))
The only clues I have via sampling the execution of this code (using VisualVM) shows most of the time spent in clojure.core$empty_QMARK_, and most of the rest in clojure.core$contains_QMARK_. (first and rest take only a small fraction of the time.) (See attached .
Any suggestions as to how I could tighten the code?
First I would say that you haven't expanded that profile deep enough. empty? is not an expensive function in general. The reason it is taking up all your time is almost surely because the input to your function is a lazy sequence, and empty? is the poor sap whose job it is to look at its elements first. So all the time in empty? is probably actually time you should be accounting to whatever generates the input sequence. You could confirm this by profiling (tstBisectScore bisect (doall edges)) and comparing to your existing profile of (tstBisectScore bisect edges).
Assuming that my hypothesis is true, almost 80% of your real workload is probably in generating the bisects, not in scoring them. So anything we do in this function can get us at most a 20% speedup, even if we replaced the whole thing with (map (constantly 0) edges).
Still, there are many local improvements to be made. Let's imagine we've determined that producing the input argument is as efficient as we can get it, and we need more speed.
When iterating eagerly over something, use next instead of rest. The point of rest is that it's a bit lazier, and always returns a non-nil sequence instead of peeking to see if there is a next element. If you know you will need the next element anyway, use next to get both bits of information at once.
In general, empty? is not an efficient way to test a sequence. (defn empty? [x] (not (seq x))) is obviously a wasted not. If you care about efficiency, write (seq x) instead, and invert your if branches. Better still, if you know x is the result of a next call, it can never be an empty sequence: only nil, or a non-empty sequence. So just write (if x ...).
(or (and inb1 inb2)
(and (not inb1) (not inb2)))
is a very expensive way to write (= inb1 inb2).
So for starters, you could instead write
(defn tstBisectScore
([bisect edges] (tstBisectScore bisect 0 (seq edges)))
([bisect nx edges]
(if edges
(recur bisect (let [[n1 n2] (first edges)
inb1 (contains? bisect n1)
inb2 (contains? bisect n2)]
(if (= inb1 inb2) nx (inc nx)))
(next edges))
nx)))
Note that I've also rearranged things a bit, by putting the if and let inside of the recur instead of duplicating the other arguments to the recur. This isn't a very popular style, and it doesn't matter to efficiency. Here it serves a pedagogical purpose: to draw your attention to the basic structure of this function that you missed. Your whole function has the structure(if xs (recur (f acc x) (next xs))). This is exactly what reduce already does!
I could write out the translation to use reduce, but first I'll also point out that you also have a map step hidden in there, mapping some elements to 1 and some to 0, and then your reduce phase is just summing the list. So, instead of using lazy sequences to do that, we'll use a transducer, and avoid allocating the intermediate sequences:
(defn tstBisectScore [bisect edges]
(transduce (map (fn [[n1 n2]]
(if (= (contains? bisect n1)
(contains? bisect n2)
0, 1)))
+ 0 edges))
This is a lot less code because you let existing abstractions do the work for you, and it should be more efficient because (a) these abstractions don't make the local mistakes you did, and (b) they also handle chunked sequences more efficiently, which is a sizeable boost that comes up surprisingly often when using basic tools like map, range, and filter.
This answer is based this answer from amalloy and shows some additional ways to speed up this code:
Use Java arrays:
Convert edges with (into-array (map into-array edges)). This allows you to use operations like aget, aset and especially areduce.
Use Java functions
In the following code, I replaced = with .equals and contains? with .contains.
Use type hints
Using these tips, I rewrote your function like this:
(defn tst-bisect-score [^HashSet bisect
^"[[Ljava.lang.Long;" edges]
(areduce edges
i
ret
(long 0)
(+ ret
(let [^"[Ljava.lang.Long;" e (aget edges i)]
(if (.equals ^Boolean
(.contains ^HashSet bisect
^Long (aget e 0))
^Boolean
(.contains ^HashSet bisect
^Long (aget e 1)))
0 1)))))
Convert your arguments in advance with (HashSet. ^Collection bisect) and (into-array (map into-array edges)) and then call:
(tst-bisect-score bisect edges)
try to figure out how to use "append" in Scheme
the concept of append that I can find like this:
----- part 1: understanding the concept of append in Scheme-----
1) append takes two or more lists and constructs a new list with all of their elements.
2) append requires that its arguments are lists, and makes a list whose elements are the elements of those lists. it concatenates the lists it is given. (It effectively conses the elements of the other lists onto the last list to create the result list.)
3) It only concatenates the top-level structure ==> [Q1] what does it mean "only concatenates the top-level"?
4) however--it doesn't "flatten" nested structures.
==> [Q2] what is "flatten" ? (I saw many places this "flatten" but I didn't figure out yet)
==> [Q3] why append does not "flatten" nested structures.
---------- Part 2: how to using append in Scheme --------------------------------
then I looked around to try to use "append" and I saw other discussion
based on the other discussion, I try this implementation
[code 1]
(define (tst-foldr-append lst)
(foldr
(lambda (element acc) (append acc (list element)))
lst
'())
)
it works, but I am struggling to understand that this part ...(append acc (list element)...
what exactly "append" is doing in code 1, to me, it just flipping.
then why it can't be used other logics e.g.
i) simply just flip or
iii).... cons (acc element).....
[Q4] why it have to be "append" in code 1??? Is that because of something to do with foldr ??
again, sorry for the long question, but I think it is all related.
Q1/2/3: What is this "flattening" thing?
Scheme/Lisp/Racket make it very very easy to use lists. Lists are easy to construct and easy to operate on. As a result, they are often nested. So, for instance
`(a b 34)
denotes a list of three elements: two symbols and a number. However,
`(a (b c) 34)
denotes a list of three elements: a symbol, a list, and a number.
The word "flatten" is used to refer to the operation that turns
`(3 ((b) c) (d (e f)))
into
`(3 b c d e f)
That is, the lists-within-lists are "flattened".
The 'append' function does not flatten lists; it just combines them. So, for instance,
(append `(3 (b c) d) `(a (9)))
would produce
`(3 (b c) d a (9))
Another way of saying it is this: if you apply 'append' to a list of length 3 and a list of length 2, the result will be of length 5.
Q4/5: Foldl really has nothing to do with append. I think I would ask a separate question about foldl if I were you.
Final advice: go check out htdp.org .
Q1: It means that sublists are not recursively appended, only the top-most elements are concatenated, for example:
(append '((1) (2)) '((3) (4)))
=> '((1) (2) (3) (4))
Q2: Related to the previous question, flattening a list gets rid of the sublists:
(flatten '((1) (2) (3) (4)))
=> '(1 2 3 4)
Q3: By design, because append only concatenates two lists, for flattening nested structures use flatten.
Q4: Please read the documentation before asking this kind of questions. append is simply a different procedure, not necessarily related to foldr, but they can be used together; it concatenates a list with an element (if the "element" is a list the result will be a proper list). cons just sticks together two things, no matter their type whereas append always returns a list (proper or improper) as output. For example, for appending one element at the end you can do this:
(append '(1 2) '(3))
=> '(1 2 3)
But these expressions will give different results (tested in Racket):
(append '(1 2) 3)
=> '(1 2 . 3)
(cons '(1 2) '(3))
=> '((1 2) 3)
(cons '(1 2) 3)
=> '((1 2) . 3)
Q5: No, cons will work fine here. You wouldn't be asking any of this if you simply tested each procedure to see how they work. Please understand what you're using by reading the documentation and writing little examples, it's the only way you'll ever learn how to program.
I'm a rank beginner with Typed Racket, and I was playing around with the very simple Tree type defined in the Beginner's Guide:
#lang typed/racket
(define-type Tree (U leaf node))
(struct: leaf ([val : Number]))
(struct: node ([left : Tree] [right : Tree]))
As an exercise, I decided to write a higher-order function to descend the tree:
(: tree-descend : All (A) (Number -> A) (A A -> A) Tree -> A)
(define (tree-descend do-leaf do-node tree)
(if (leaf? tree)
(do-leaf (leaf-val tree))
(do-node (tree-descend do-leaf do-node (node-left tree))
(tree-descend do-leaf do-node (node-right tree)))))
This type-checks fine. However, when I try to use it to redefine the tree-sum function that sums all over the leaves, I get a surprising and verbose error message:
(: tree-sum : Tree -> Number)
(define (tree-sum t)
(tree-descend identity + t))
The error messages are
Type Checker: Polymorphic function `tree-descend' could not be applied to arguments:
Argument 1:
Expected: (Number -> A)
Given: (All (a) (a -> a))
Argument 2:
Expected: (A A -> A)
Given: (case-> (-> Zero) (Zero Zero -> Zero) (One Zero -> One)
(Zero One -> One) (Positive-Byte Zero -> Positive-Byte)
[...lots of ways of combining subtypes of Number...]
(Index Positive-Index Index -> Positive-Fixnum)
(Index Index Positive-Index -> in: (tree-descend identity + t)
Now, to my untrained eye, it looks like this should work just fine, because clearly the polymorphic type A should just be Number and then everything works. Obviously the language disagrees with me for some reason, but I'm not sure what that reason is.
Unfortunately, Typed Racket can't infer the proper instantiation of polymorphic functions like tree-descend when you apply them to polymorphic arguments like identity. If you replace identity with (inst identity Number), then the program works fine.
In Structure and Interpretation of Computer Programs (SICP) Section 2.2.3 several functions are defined using:
(accumulate cons nil
(filter pred
(map op sequence)))
Two examples that make use of this operate on a list of the fibonacci numbers, even-fibs and list-fib-squares.
The accumulate, filter and map functions are defined in section 2.2 as well. The part that's confusing me is why the authors included the accumulate here. accumulate takes 3 parameters:
A binary function to be applied
An initial value, used as the rightmost parameter to the function
A list to which the function will be applied
An example of applying accumulate to a list using the definition in the book:
(accumulate cons nil (list 1 2 3))
=> (cons 1 (cons 2 (cons 3 nil)))
=> (1 2 3)
Since the third parameter is a list, (accumulate cons nil some-list) will just return some-list, and in this case the result of (filter pred (map op sequence)) is a list.
Is there a reason for this use of accumulate other than consistency with other similarly structured functions in the section?
I'm certain that those two uses of accumulate are merely illustrative of the fact that "consing elements to construct a list" can be treated as an accumulative process in the same way that "multiplying numbers to obtain a product" or "summing numbers to obtain a total" can. You're correct that the accumulation is effectively a no-op.
(Aside: Note that this could obviously be a more useful operation if the output of filter and input of accumulate was not a list; for example, if it represented a lazily generated sequence.)
For clojure's sorted-map, how do I find the entry having the key closest to a given value?
e.g. Suppose I have
(def my-map (sorted-map
1 A
2 B
5 C))
I would like a function like
(find-closest my-map 4)
which would return (5,C), since that's the entry with the closest key. I could do a linear search, but since the map is sorted, there should be a way of finding this value in something like O(log n).
I can't find anything in the API which makes this possible. If, for instance, I could ask for the i'th entry in the map, I could cobble together a function like the one I want, but I can't find any such function.
Edit:
So apparently sorted-map is based on a PersistentTreeMap class implemented in java, which is a red and black tree. So this really seems like it should be doable, at least in principle.
subseq and rsubseq are very fast because they exploit the tree structure:
(def m (sorted-map 1 :a, 2 :b, 5 :c))
(defn abs [x] (if (neg? x) (- x) x))
(defn find-closest [sm k]
(if-let [a (key (first (rsubseq sm <= k)))]
(if (= a k)
a
(if-let [b (key (first (subseq sm >= k)))]
(if (< (abs (- k b)) (abs (- k a)))
b
a)))
(key (first (subseq sm >= k)))))
user=> (find-closest m 4)
5
user=> (find-closest m 3)
2
This does slightly more work than ideal, in the ideal scenario we would just do a <= search then look at the node to the right to check if there is anything closer in that direction. You can access the tree (.tree m) but the .left and .right methods aren't public so custom traversal is not currently possible.
Use the Clojure contrib library data.avl. It supports sorted-maps with a nearest function and other useful features.
https://github.com/clojure/data.avl
The first thing that comes to my mind is to pull the map's keys out into a vector and then to do a binary search in that. If there is no exact match to your key, the two pointers involved in a binary search will end up pointing to the two elements on either side of it, and you can then choose the closer one in a single (possibly tie breaking) operation.