I would like to run Stanford neural dependency parser which has very impressive performance like 92.0% UAS, 89.7% LAS (Chen & Manning, 2014). I tried to follow their instructions but got sad numbers: 66.2% UAS, 62.0% LAS. Could somebody please tell me what I did wrong?
The commands:
PENN_TEST_PATH="test.mrg"
CONLL_TEST_PATH="$PENN_TEST_PATH.dep"
cat penntree/23/* > $PENN_TEST_PATH
java -cp stanford-parser-full-2014-10-31/stanford-parser.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -originalDependencies -conllx -treeFile $PENN_TEST_PATH > $CONLL_TEST_PATH
java -cp stanford-parser-full-2014-10-31/stanford-parser.jar edu.stanford.nlp.parser.nndep.DependencyParser -model stanford-parser-full-2014-10-31/PTB_Stanford_params.txt.gz -testFile $CONLL_TEST_PATH
Output:
Loading depparse model file: stanford-parser-full-2014-10-31/PTB_Stanford_params.txt.gz ...
dict=44392
pos=48
label=46
embeddingSize=50
hiddenSize=200
numTokens=48
preComputed=422468
###################
#Transitions: 91
#Labels: 45
ROOTLABEL: root
PreComputed 100000, Elapsed Time: 1.789 (s)
Initializing dependency parser done [2.6 sec].
Test File: test.mrg.dep
UAS = 66.2110
LAS = 62.0160
DependencyParser tagged 56684 words in 2416 sentences in 3.4s at 16559.7 w/s, 705.8 sent/s.
References
Chen, D., & Manning, C. (2014). A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 740–750). Doha, Qatar: Association for Computational Linguistics.
I found the problem. I need to call edu.stanford.nlp.trees.EnglishGrammaticalStructure with -basic option.
Related
I want to run a model in Mallet and need the topic-docs output, which gives the most prominent documents for each topic. This is necessary for interpreting the less clear topics correctly. But Mallet keeps on giving me empty txt files.
This is the command I use:
bin\mallet train-topics --input cleandata1000.mallet --num-topics 250 --num-iterations 3000 --optimize-interval 50 --optimize-burn-in 50 --output-topic-keys 1000-300-3000-50-topic-keys.txt --output-topic-docs 1000-300-1000-50-topic-docs.txt --num-top-docs 20 --output-doc-topics 1000-300-1000-50-doc-topics.txt --doc-topics-threshold 0.01 --xml-topic-phrase-report 1000-300-1000-50-topic-phrase.xml --output-state 1000-300-1000-50-state.gz --use-symmetric-alpha true
Does anyone know what the cause could be?
Edit in response to David Mimno's 4 Nov comment:
The same thing happens with different data (where the docs have a different lenght).
I just ran some other models with Mallet's test data. Peculiar: This trial gave no output at all (so the "en-topic-docs.txt" did not get made).
bin\mallet train-topics --input en.mallet --num-topics 5 --output-topic-docs en-topic-docs.txt
When I ask for the topic keys as output, both files are made, but the en-topic-docs.txt is empty.
bin\mallet train-topics --input en.mallet --num-topics 5 --output-topic-keys en-topic-keys.txt --output-topic-docs en-topic-docs.txt
My bad: there is a recurring error message:
Exception in thread "main" java.lang.ClassCastException: class java.net.URI cannot be cast to class java.lang.String (java.net.URI and java.lang.String are in module java.base of loader 'bootstrap')
at cc.mallet.topics.ParallelTopicModel.printTopicDocuments(ParallelTopicModel.java:1773)
at cc.mallet.topics.tui.TopicTrainer.main(TopicTrainer.java:281)
I don't know what this might mean.
Thank you for any help, you are saving my PhD :)
I was able to fix this by using the latest release on github (202108) instead of MALLET 2.0.8. Now it works like a charm.
Instructions for using the developmental release: http://mallet.cs.umass.edu/download.php
Thank you for the pointers, David Mimno!
I am trying to produce a algorithm in Latex but keep getting the same error: ! LaTeX Error: File 'float.sty' not found.. Even if i recreate examples in new documents.
The problem occurs when I use the package algorithm which should allow me to create a algorithm environment. The log file indicates that LaTeX couldn't find the float.sty.
A simple solution would been found by just simply adding \usepackage{float} to the preamble. But adding the package leads to an error on the line \usepackage{algorithm}.
Here is the example code:
\documentclass{article}
\usepackage{algpseudocode,algorithm,algorithmicx}
\newcommand*\DNA{\textsc{dna}}
\newcommand*\Let[2]{\State #1 $\gets$ #2}
\algrenewcommand\algorithmicrequire{\textbf{Precondition:}}
\algrenewcommand\algorithmicensure{\textbf{Postcondition:}}
\begin{document}
\begin{algorithm}
\caption{Counting mismatches between two packed \DNA{} strings
\label{alg:packed-dna-hamming}}
\begin{algorithmic}[1]
\Require{$x$ and $y$ are packed \DNA{} strings of equal length $n$}
\Statex
\Function{Distance}{$x, y$}
\Let{$z$}{$x \oplus y$} \Comment{$\oplus$: bitwise exclusive-or}
\Let{$\delta$}{$0$}
\For{$i \gets 1 \textrm{ to } n$}
\If{$z_i \neq 0$}
\Let{$\delta$}{$\delta + 1$}
\EndIf
\EndFor
\State \Return{$\delta$}
\EndFunction
\end{algorithmic}
\end{algorithm}
With the help samcarter_is_at_toanswers.xyz provided, I obtained a bit more understanding in how LaTeX operates. The problem was that the MiKTeX 2.9 file for some reason didn't exist anymore (or at least couldn't be located by LaTeX or manually).
So I used this answered question on to guide me to reinstall MiKTeX. This resolved the problem. After the reinstall TeXmaker was able to download the float package with the float.sty which I needed to resolve original error caused by the algorithm package.
I've little to no experience with LaTeX/TeXmaker/MiKTeX and I'am not that technical with computers. So please let me know if you have a better answer/explanation/understanding of the problem. I will edit/remove my answer.
I am trying to analyse some French text with the Stanford CoreNLP tool (it's my first time trying to use any StanfordNLP software)
To do so, I have downloaded the v3.6.0 jar and the corresponding french models.
Then I run the server with:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
As described in this answer, I call the API with:
wget --post-data 'Bonjour le monde.' 'localhost:9000/?properties={"parse.model":"edu/stanford/nlp/models/parser/nndep/UD_French.gz", "annotators": "parse", "outputFormat": "json"}' -O -
but I get the following log + error:
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP
Adding annotator tokenize
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-1-thread-1] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/parser/nndep/UD_French.gz ...
edu.stanford.nlp.io.RuntimeIOException: java.io.StreamCorruptedException: invalid stream header: 64696374
at edu.stanford.nlp.parser.common.ParserGrammar.loadModel(ParserGrammar.java:188)
at edu.stanford.nlp.pipeline.ParserAnnotator.loadModel(ParserAnnotator.java:212)
at edu.stanford.nlp.pipeline.ParserAnnotator.<init>(ParserAnnotator.java:115)
...
The solutions proposed here suggest the code and model version differs but I have dowloaded them from the same page (and they both have the same version number in their name) so I am pretty sure they are the same.
Any other hint on what I am doing wrong?
(I should also mention that I am not a Java expert, so maybe I forgot a stupid step... )
Ok, after a lot of readings and unsuccessful tries, I found a way to make it work (for v3.6.0). Here are the details, if they can be of any interest to someone else:
Dowload the code and french models from http://stanfordnlp.github.io/CoreNLP/index.html#download. Unzip the code .zip and copy the french model .jar to that directory (do not remove the english models, they have different names anyway)
cd to that directory and then run the server with:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
(it's a pity that the -prop flag doesn't help here)
Call the API repeating the properties listed in the StanfordCoreNLP-french.properties:
wget --header="Content-Type: text/plain; charset=UTF-8"
--post-data 'Bonjour le monde.'
'localhost:9000/?properties={
"annotators": "tokenize,ssplit,pos,parse",
"parse.model":"edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz",
"pos.model":"edu/stanford/nlp/models/pos-tagger/french/french.tagger",
"tokenize.language":"fr",
"outputFormat": "json"}'
-O -
which finally gives a 200 response using the French models!
(NB: don't know how to make it work with the UI (same for utf-8 support))
As a potentially useful addition for some, this is what the complete properties file for German looks like:
# annotators
annotators = tokenize, ssplit, mwt, pos, ner, depparse
# tokenize
tokenize.language = de
tokenize.postProcessor = edu.stanford.nlp.international.german.process.GermanTokenizerPostProcessor
# mwt
mwt.mappingFile = edu/stanford/nlp/models/mwt/german/german-mwt.tsv
# pos
pos.model = edu/stanford/nlp/models/pos-tagger/german-ud.tagger
# ner
ner.model = edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz
ner.applyNumericClassifiers = false
ner.applyFineGrained = false
ner.useSUTime = false
# parse
parse.model = edu/stanford/nlp/models/srparser/germanSR.beam.ser.gz
# depparse
depparse.model = edu/stanford/nlp/models/parser/nndep/UD_German.gz
The complete property files for Arabic, Chinese, French, German and Spanish can all be found in the CoreNLP github repository.
Anyone know where the following files located:
trainFileList = /u/nlp/data/ner/column_data/muc6.ptb.train,
/u/nlp/data/ner/column_data/muc7.ptb.train
I am following the FAQ link http://nlp.stanford.edu/software/crf-faq.shtml#a
If all I need to do is provide a file with two columns consisting of tokens and class, then that will work. But I am curious about the train files listed in the classifier property files.
serializeTo = english.muc.7class.caseless.distsim.crf.ser.gz
java -mx1g -cp "$CLASSPATH" edu.stanford.nlp.ie.NERClassifierCombiner -textFile sample.txt -ner.model classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz -outputFormat tabbedEntities -textFile sample.txt > sample2.tsv
Those files are the training data for the MUC-6 and MUC-7 tasks:
http://cs.nyu.edu/faculty/grishman/muc6.html
They are not distributed by Stanford. I will see if I can figure out where they are distributed and update this answer.
UPDATE: LDC distributes those files if you want to get a copy, they have copyright issues so you have to purchase them from LDC, that is why we don't distribute them. Here are some links with more info:
http://www-nlpir.nist.gov/related_projects/muc/muc_data/muc_data_index.html
https://catalog.ldc.upenn.edu/LDC2003T13
https://catalog.ldc.upenn.edu/LDC2001T02
All -
Running Stanford CoreNLP 3.4.1, plus the Spanish models. I have a directory of approximately 100 Spanish raw text documents, UTF-8 encoded. For each one, I execute the following commandline:
java -cp stanford-corenlp-3.4.1.jar:stanford-spanish-corenlp-2014-08-26-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-0.23.jar -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props <propsfile> -file <txtfile>
The props file looks like this:
annotators = tokenize, ssplit, pos
tokenize.language = es
pos.model = edu/stanford/nlp/models/pos-tagger/spanish/spanish-distsim.tagger
For almost every file, I get the following error:
Exception in thread "main" java.lang.RuntimeException: Error annotating :
at edu.stanford.nlp.pipeline.StanfordCoreNLP$15.run(StanfordCoreNLP.java:1287)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1347)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.run(StanfordCoreNLP.java:1389)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1459)
Caused by: java.lang.NullPointerException
at edu.stanford.nlp.tagger.maxent.ExtractorSpanishStrippedVerb.extract(ExtractorFramesRare.java:1626)
at edu.stanford.nlp.tagger.maxent.Extractor.extract(Extractor.java:153)
at edu.stanford.nlp.tagger.maxent.TestSentence.getExactHistories(TestSentence.java:465)
at edu.stanford.nlp.tagger.maxent.TestSentence.getHistories(TestSentence.java:440)
at edu.stanford.nlp.tagger.maxent.TestSentence.getHistories(TestSentence.java:428)
at edu.stanford.nlp.tagger.maxent.TestSentence.getExactScores(TestSentence.java:377)
at edu.stanford.nlp.tagger.maxent.TestSentence.getScores(TestSentence.java:372)
at edu.stanford.nlp.tagger.maxent.TestSentence.scoresOf(TestSentence.java:713)
at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:91)
at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:31)
at edu.stanford.nlp.tagger.maxent.TestSentence.runTagInference(TestSentence.java:322)
at edu.stanford.nlp.tagger.maxent.TestSentence.testTagInference(TestSentence.java:312)
at edu.stanford.nlp.tagger.maxent.TestSentence.tagSentence(TestSentence.java:135)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.tagSentence(MaxentTagger.java:998)
at edu.stanford.nlp.pipeline.POSTaggerAnnotator.doOneSentence(POSTaggerAnnotator.java:147)
at edu.stanford.nlp.pipeline.POSTaggerAnnotator.annotate(POSTaggerAnnotator.java:110)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:67)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:847)
at edu.stanford.nlp.pipeline.StanfordCoreNLP$15.run(StanfordCoreNLP.java:1275)
Any ideas? I haven't even begun to track this down. I'm certain the problem is in POS; tokenize and ssplit run just fine.
P.S. Please don't say "Upgrade to 3.5.0"; I don't currently have Java 8 installed and don't want to install it yet.
Thanks in advance.
Yes, it seems like there's a bug in the 3.4.1 Spanish models.
The Spanish 3.5.0 models actually seem to be compatible with Java 7. You can download the models used in 3.5 (stanford-spanish-corenlp-2014-10-23-models.jar) and put that on your classpath instead. This fixed the problem for me running Java 7 locally.