When I am trying to build PTB format dataset with the following command, I am getting a message like 'Unknown argument -model'. Why am I getting this message? Is there something wrong with the command?
java -cp "*" edu.stanford.nlp.sentiment.BuildBinarizedDataset -model edu/stanford/nlp/models/sentiment/sentiment.ser.gz -input train.txt
Use the -sentimentModel flag rather than -model.
Related
I am trying to extract the text from sample pdf located here by using Apache Tika.
Using the GUI of Apache Tike (run from console: java -jar tika-app-1.22.jar) seems to produce plain text pretty well. The text looks like this:
The issue when I try to extract the text by running the following command:
java -jar tika-app-1.22.jar --text lorem-ipsum.pdf
This produces text that has the á character everywhere whitespace is expected:
Any help configuring the command to produce normal whitespaces is appreciated, thank you.
We are trying to extract the EURO value from the document. Stanford is recognizing the money as expected. However it is during extracting it is converting € to $.
Here is a sample command to run Stanford CoreNLP and turn off the currency normalization:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -file sample-sentence.txt -outputFormat text -tokenize.options "normalizeCurrency=false"
If you are using corenlp as a dedicated server, you can include -tokenize.options parameter in url when sending the request.
Eg.
http://corenlp.run?properties={"timeout":"36000","annotators":"tokenize,ssplit,parse,lemma,ner,regexner","tokenize.options":"normalizeCurrency=false,invertible=true"}
I am trying to process many snippets of text using the stanford parser. I am outputing to xml using this command
java -cp stanford-corenlp-3.3.1.jar:stanford-corenlp-3.3.1-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,parse -file test
all i need is the sentence parse of each of the snippets. The problem is that the snippets can have more than one sentence, and the output xml gives all the sentences together, so i cant know which sentences belong to what snippet. I could add a separator word between different sentences, but i think there must be a built in capability to show separation.
There is a parameter -fileList that takes a string of comma-separated files as its input.
Example:
java -cp stanford-corenlp-3.3.1.jar:stanford-corenlp-3.3.1-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,parse -fileList=file1.txt,file2.txt,file3.txt
Have a look at the SentimentPipeline.java (edu.stanford.nlp.sentiment) for further details.
Say I have a file on HDFS:
1
2
3
I want it transformed to
a
b
c
I wrote a mapper.py:
#!/usr/bin/python
import sys
for line in sys.stdin:
print chr(int(line) + ord('a') - 1)
then using the streaming api:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \
-mapper mapper.py -file mapper.py -input /input -output /output
But the result in /output is "a\t\nb\t\nc\t\n":
a\t
b\t
c\t
note those extra unprintable tab characters, I used '\t' instead. It's documented here:
If there is no tab character in the line, then entire line is considered as key and the value is null.
So the tabs were added by streaming api as separators. But however I modify the separator related options, I can't make it disappear.
Thus my problem is, is there a way to do that job clean, without extra things like tabs?
Or to make it clearer, is there a way to use hadoop just as a distributed filter, disgarding its key/value mechanism?
====
update # 2013.11.27
As I discussed with friends, there's no easy way to achieve the goal, and I made a workaround to this problem by using tabs as field separator in my output, and set tab as field separator in hive as well.
Some of my friends proposed using of -D mapred.textoutputformat.ignoreseparator=true, but that parameter just won't work. I investigated into this file:
hadoop-1.1.2/src/mapred/org/apache/hadoop/mapred/TextOutputFormat.java
and didn't find the option. But as an alternative solution, streaming api accepts a parameter -outputformat which specifies another outputformat.
According to this article, you can make a copy of TextOutputFormat.java, remove the default '\t', compile it, and then pack it as a jar, and call streaming api with -libjars yourjar.jar -outputformat path.to.your.outputformat. But I didn't succeed this way with hadoop-1.1.2. Just write this down for others' reference.
You should be able to get rid of these delimiters by specifying that your job is a map-only job - which is essentially what you want to be a distributed filter and the output of your mapper will be the final output.
To do that in Hadoop streaming, you can use the following option:
-D mapred.reduce.tasks=0
So for the full command this would look something like:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar -D mapred.reduce.tasks=0 -mapper mapper.py -file mapper.py -input /input -output /output
I want to pass a filter statement with in my pig script using parameter substitution
For that I have tried
exec -param flt='a1==1 AND a2=2' filterscript.pig
But sadly it is throwing an exception message
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 101: Local file 'AND' does not exist.
Pig version - 0.9.2
I have tried flt='\'a1==1 AND a2=2\'' and flt="a1==1 AND a2==2" suggested by pig users in apache forum as well as seen a similar post in SO.
Any help will be appreciated
I think you are using the parameter passed as it is as a condition. If so you will get an error like this. Instead you can pass them as separate paarmeters and form the condition string inside the pig script.
exec -p p1=1 -p p2=2 filterscript.pig
Inside your filterscript.pig script you can use these parameter values in condition clauses. For example
a1==$p1 AND a2=$p2
If you run your script outside the grunt shell you can do the followings:
pig -param flt="a1\=\=1 AND a2\=\=2" -f filterscript.pig
where filterscript.pig is something like this:
A = load ...
...
B = filter A by $flt;
...
Note that the '=' is also escaped, otherwise the filter condition won't be evalued to boolean.
If you want to use the filter substitution within the grunt shell as you tried with exec,
then you'll encounter the whitespace problem. Since escaping the whitespace character doesn't work, as a workaround you can create a parameter file :
cat params.txt
flt="a1\=\=1 AND a2\=\=2"
Then issue:
exec -param_file params.txt filterscript.pig
Note: I use Pig 0.12