What does NML tag convey? - stanford-nlp

Basic query:
Stanford parser version 4.0.0 uses NML tag. I think is a useful feature, but I do not fully understand it. So I would appreciate more information about it, e.g. its full form and the motivation for introducing it etc. Why does it treat "Income tax proposal" and "Fish tank water"
differently. Has parser learnt the use of NML tag correctly?
Following is optional, please read it if you think that I am making up a fictious tag!
Following information is just to establish that this a serious enquiry. My previous query about NML tag was rejected because my guess of meaning of NML tag mislead me and some how I gave a wrong example! I am sorry for that.
Please see:
https://nlp.stanford.edu/nlp/javadoc/javanlp/index.html?edu/stanford/nlp/trees/ModCollinsHeadFinder.html
Under the heading Changes:
QUOTE
Added NML head rules, which are the same as for NP.
NP head rule: NP and NML are treated almost identically (NP has precedence)
NAC head rule: NML comes after NN/NNS but after NNP/NNPS
UNQUOTE
I am getting NML tags in several sentences while running the Stanford parser version 4.0.0
Here is just one example:
Parsing [sent. 1 len. 7]: The income tax proposal was rejected .
(ROOT
(S
(NP (DT The)
(NML (NN income) (NN tax))
(NN proposal))
(VP (VBD was)
(VP (VBN rejected)))
(. .)))

The NML label should be for a noun phrase that is modifying another word or phrase. So a good example would be income tax proposal. income tax is an NML since it is serving as an adjective of proposal. It is describing the type of proposal.
Syntactically income tax proposal and marriage proposal share the same high level structure, a noun phrase describing another noun, so the point of NML is to note that the phrase income tax is a complete object and it is modifying the word proposal to generate the final NP of income tax proposal.
If the actual statistical parser is inconsistent, as in the case of fish tank water, that is more likely an error in the model itself, which is just something you have to accept. Statistical parsers make lots of errors all the time.

Related

Why does MaxentTagger tag numbers as NN sometimes?

I am trying to tag a HTML page full of space-separated numbers like "5320412185 5320412184 5320412189..." to observe how the tagger behaves with numbers. I'm using english-left3words-distsim.tagger in the constructor. I'm observing on the console that most of the numbers are tagged as CD but at times there are also numbers getting tagged as NN. I searched on the FAQ page of nlp.stanford.edu but I couldn't find this there. Can anyone help me in understanding this?
I don't know if I should need to mention this: I'm feeding each number separately to the tagger by splitting the huge input(1045000 numbers!) based on space-delimiter.
From Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision)
Sometimes, it is unclear whether one is cardinal number or a noun. In general, it should be tagged as a
cardinal number (CD) even when its sense is not clearly that of a numeral.
EXAMPLE: one/CD of the best reasons
But if it could be pluralized or modified by an adjective in a particular context, it is a common noun (NN).
EXAMPLE: the only (good) one/NN of its kind
(cf. the only (good) ones/NNS of their kind)
In the collocation another one, one should also be tagged as a common noun (NN).
Hyphenated fractions one-half, three-fourths, seven-eighths, one-and-a-half, seven-and-three-eighths should
be tagged as adjectives (JJ) when they are prenominal modifiers, but as adverbs (RB) if they could be
replaced by double or twice.
For further reading: http://repository.upenn.edu/cgi/viewcontent.cgi?article=1603&context=cis_reports

Stanford NLP. The Solution to Draw Graphical Tree

I have a tree. I want to draw it into form(graphical). After that, it can extend to add, delete node, edit pos tag of tree on this graphic. can give me some ideas to start with this issue. sorry for bad english.
example tree:
(ROOT
(S (NP (NNP John))
(VP (VBZ loves)
(NP (NNP Mary)))
(. .)))
You can take a look at PennTreeReader in CoreNLP for code to read the string form of a tree into a Tree object. Beyond that, the design of the visualization is up to you. A good place to start might be D3; there's even a CoreNLP Parse Tree demo.

algorithm to get topic / focus of sentence out of words in sentence

Are there any well-know or successful algorithms for obtaining the topic and / or focus of a sentence ( question ) out of the words in the sentence question?
If not, how would I got about getting the topic / focus of the question. It seems that the topic / focus of the questions is usually a noun or a noun-phrase.
So the first thing I would do is determine the nouns by Part Of Speech tagging the question. but then how do I know if I should get just the nouns or the noun(s) and a adjective before it, or the noun and the adverb before it, or the noun(s) and verb?
For example:
In ' did the quick brown fox jump over the lazy dog ', get ' quick brown fox ', ' jump ', and ' lazy dog '.
In ' what is the population of japan ', get ' population ' and ' japan '
In ' what color is milk ' get ' color ' and ' milk '
In ' What is the height of Mt. Everest ' get ' Mt. Everst ' and ' Height '.
While writing these I guess the easiest way is removing stop words.
I think first of all that the problem is language-dependent.
Secondly I think that if you have a set of words, you could run a check on their popularity/frequency in the language; f.e. the word "the" occurs much more often that the word "euphoric" => euphoric has more chance of being a proper keyword.
Here the importance of spelling is however crucial. How to deal with this? One idea is to use distance-algorithms such as Levenshtein to words that do not occur often (or do a google-search with the word and check if you get results or a "did-you-mean"-notification)
Some languages are though more structured that other. In english to find nouns, you can run first a check with "a/an word" and then words that end in "s" to find possible candidates for nouns. Then make a comparison with a dictionary.
With adjectives you can perhaps assume that a possible adjective will be located right before the noun. Then just compare the possible adjective with the dictionary.
Then you could of course keep a black-list of words that are never allowed as keywords.
The best solution would perhaps be to have a self-learning neural system but I'm not so familiar with those to give any suggestions
This could be thought of as a parsing problem and I personally find the stanford nlp tool very effective .
Here is the link to the demo of the stanford parser
For the example , did the quick brown fox jump over the lazy dog
The output you get is
did/VBD
the/DT
quick/JJ
brown/JJ
fox/NN
jump/VB
over/RP
the/DT
lazy/JJ
dog/NN
From the output you can write an extractor to extract the nouns ( adjectives and adverbs if need be) and thus obtain the topics from the sentence .
Moreover , the parse tree looks like
(ROOT
(SINV (VBD did)
(NP (DT the) (JJ quick) (JJ brown) (NN fox))
(VP (VB jump)
(PRT (RP over))
(NP (DT the) (JJ lazy) (NN dog)))))
If you take a closer look at the parse tree , the output you are expecting are both the NP(noun phrases) - the quick brown fox and the lazy dog .
I hope this helps !

Solving Caliban problems with prolog

I'm working on solving a logic puzzle using prolog for school. Here's the clues:
Brown, Clark, Jones and Smith are 4 substantial citizens who serve their
community as achitect, banker, doctor and lawyer, though not necessarily
respectively.
Brown, who is more conservative than Jones but more liberal than Smith,
is a better golfer than the men who are younger than he is and has a
larger income than the men who are older than Clark.
The banker, who earns more than the architect, is neither the youngest
nor the oldest.
The doctor, who is a poorer golfer than the lawyer, is less conservative
than the architect.
As might be expected, the oldest man is the most conservative and has the
largest income, and the youngest man is the best golfer.
What is each man's profession?
hint: to rank people for weath, ability, relative age, etc
use the numbers 1,2,3,4 Be careful to state whether 1 represents,
e.g., youngest or oldest. Doing this makes comparisons easy to code.
To code (as follows) interprets all the relationships, given by the clues, as a list of lists, wherein each list defines the
%[profession,surname,politics,relative_age, relative_salary, golf_ability]:
profession(L) :- L = [[_,'Brown',_,_,_,_],[_,'Jones',_,_,_,_],[_,'Clark',_,_,_,_],
[_,'Smith',_,_,_,_]],
member([_,'Brown',P1,A6,M3,G3],L),
member([_,'Jones',P2,_,_,_],L),
member([_,'Clark',_,A3,_,_],L),
member([_,'Smith',P3,_,_,_],L),
moreconservative(P1,P2),
moreliberal(P1,P3),
bettergolfer(G3,younger(_,A6)),
richer(M3,older(_,A3)),
member(['banker',_,_,A1,M1,_],L),
member(['architect',_,P5,_,M2,_],L),
richer(M1,M2),
(A1 = 2;A1 = 3),
member(['doctor',_,P4,_,_,G1],L),
member(['lawyer',_,_,_,_,G2],L),
worsegolfer(G1,G2),
moreliberal(P4,P5),
member([_,_,4,4,4,_],L),
member([_,_,_,1,_,4],L).
I define the relative_politics,relative_salary,relative_age, and golf_ability relationships like so
EG:
richer(4,1).
moreconservative(4,1).
poorer(1,4).
poorer(1,3).
And it goes on for all relationships.
I think I have faithfully translated all of the clues to prolog but it just says fail when I query the database. EG:
?- profession(L).
fail.
I am using NU Prolog. I'm wondering if I made an error in my translation of the clues or I omitted a fact that is needed for the database to satisfy all the conditions of the list L.
bettergolfer(G3,younger(_,A6)) ... it doesn't work this way, in Prolog. Instead, have this
( member( X,L), age(X,AX), golf(X,GX),
( younger(AX,A6) -> better_golfer(G3,GX) ; true )),
.....
age( [_,_,_,A,_,_],A).
golf([_,_,_,_,_,G],G).
.....
this means, all the persons (including none) that are younger than Brown, must be poorer golfers than he is.
There is a catch here, too. Since we're told about the men younger than Brown, it means there must exist at least one such man (unlike in the mathematical definition of implication). We have to code this too. For example,
( member(X,L), age(X,AX), younger(AX,A6) -> true ),
.....
(using unique names for the new logvars of course). You'll have to make the same transformation for your richer(M3,older(_,A3)).
Great idea BTW, to have the comparison predicates defined in a generative fashion:
poorer(1,2).
poorer(1,3).
poorer(1,4).
poorer(2,3).
poorer(2,4).
poorer(3,4).
richer(A,B):- poorer(B,A)
If you were to define them as arithmetic comparisons, poorer(A,B):- A<B., you could potentially run into problems with uninstantiated variables (as recently discussed here).

Do minor currency units have a ISO standard?

ISO 4217 defines 3-letter currency symbols:
EUR
USD
LKR
GBP
Do currencies' minor units (cent, pence) have a ISO or similar standard, too, that defines codes for those sub-units like
ct
p
?
The standard also defines the relationship between the major currency unit and any minor currency unit. Often, the minor currency unit has a value that is 1/100 of the major unit, but 1/1000 is also common. Some currencies do not have any minor currency unit at all. In others, the major currency unit has so little value that the minor unit is no longer generally used (e.g. the Japanese sen, 1/100th of a yen). This is indicated in the standard by the currency exponent. For example, USD has exponent 2, while JPY has exponent 0. Mauritania does not use a decimal division of units, setting 1 ouguiya (UM) = 5 khoums, and Madagascar has 1 ariary = 5 iraimbilanja.
Wikipedia.
As for a better word, how does minor currency unit suit? Although, Wikipedia also refers to it as sub unit. Take your pick.
There is a table on that Wikipedia article listing the standard precision for the minor currency unit.
As a sidenote, Wikipedia provides the fractional unit name for all circulating currencies.
You need to look at the standard itself.
From the ISO website:
ISO 4217:2008 specifies the structure
for a three-letter alphabetic code and
an equivalent three-digit numeric code
for the representation of currencies
and funds. For those currencies having
minor units, it also shows the decimal
relationship between such units and
the currency itself.
ISO 4217:2008 also establishes
procedures for a Maintenance Agency,
and specifies the method of
application for codes.
The key bit is:
it also shows the decimal
relationship between such units and
the currency itself.
So to answer your question, I couldn't find an ISO Standard that discusses minor units. Similar standards discuss Commercial Administration and Finance.
In the financial markets there's roughly two established industry standards.
The first one is really a case-by-case agreement, mostly enforced by exchanges that have their securities quote in the minor currency unit.
This lead to:
GBX for British pence
ZAC for South-African cents
ILA for Israeli agorot
Probably pioneered by Reuters and Bloomberg, the second standard is far more wide-spread and consistent. The agreement is to lowercase the third letter to denote the minor units.
GBp, ZAr, ILs, USd, EUr, etc.
Related discussions:
http://www.fixtradingcommunity.org/pg/discussions/topicpost/167427/

Resources