StanfordCoreNLP - Setting pipelineLanguage to German not working? - stanford-nlp
I am using the pycorenlp client in order to talk to the Stanford CoreNLP Server. In my setup I am setting pipelineLanguage to german like this:
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
text = 'Das große Auto.'
output = nlp.annotate(text, properties={
'annotators': 'tokenize,ssplit,pos,depparse,parse',
'outputFormat': 'json',
'pipelineLanguage': 'german'
})
However, from the looks I'd say that it's not working:
output['sentences'][0]['tokens']
will return:
[{'after': ' ',
'before': '',
'characterOffsetBegin': 0,
'characterOffsetEnd': 3,
'index': 1,
'originalText': 'Das',
'pos': 'NN',
'word': 'Das'},
{'after': ' ',
'before': ' ',
'characterOffsetBegin': 4,
'characterOffsetEnd': 9,
'index': 2,
'originalText': 'große',
'pos': 'NN',
'word': 'große'},
{'after': '',
'before': ' ',
'characterOffsetBegin': 10,
'characterOffsetEnd': 14,
'index': 3,
'originalText': 'Auto',
'pos': 'NN',
'word': 'Auto'},
{'after': '',
'before': '',
'characterOffsetBegin': 14,
'characterOffsetEnd': 15,
'index': 4,
'originalText': '.',
'pos': '.',
'word': '.'}]
This should be more like
Das große Auto
POS: DT JJ NN
It seems to me that setting 'pipelineLanguage': 'de' does not work for some reason.
I've executed
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
in order to start the server.
I am getting the following from the logger:
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000
[pool-1-thread-3] ERROR CoreNLP - Failure to load language specific properties: StanfordCoreNLP-german.properties for german
[pool-1-thread-3] INFO CoreNLP - [/127.0.0.1:60700] API call w/annotators tokenize,ssplit,pos,depparse,parse
Das große Auto.
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[pool-1-thread-3] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.5 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ...
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 99996, Elapsed Time: 8.645 (s)
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [9.8 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-1-thread-3] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.3 sec].
Apparently the server is loading the models for the English language - without warning me about that.
Alright, I just downloaded the models jar for German from the website and moved it into the directory where I extracted the server e.g.
~/Downloads/stanford-corenlp-full-2017-06-09 $
After re-running the server, the model was successfully loaded.
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[pool-1-thread-3] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/german/german-hgc.tagger ... done [5.1 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model file: edu/stanford/nlp/models/parser/nndep/UD_German.gz ...
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 99984, Elapsed Time: 11.419 (s)
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [12.2 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-1-thread-3] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/germanFactored.ser.gz ... done [1.0 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[pool-1-thread-3] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/german.conll.hgc_175m_600.crf.ser.gz ... done [0.7 sec].
Related
get error 'NoneType' object has no attribute 'dumps' when load model in HAYSTACK
I trying to load 'bert-base-multilingual-uncased' in haystack FARMReader and get the error: (huyenv) PS D:\study\DUANCNTT2\HAYSTACK\haystack_demo> & d:/study/DUANCNTT2/HAYSTACK/haystack_demo/huyenv/Scripts/python.exe d:/study/DUANCNTT2/HAYSTACK/haystack_demo/main.py 05/21/2021 00:12:58 INFO - faiss.loader - Loading faiss. 05/21/2021 00:12:58 - INFO - faiss.loader - Loading faiss. 05/21/2021 00:12:59 - INFO - farm.modeling.prediction_head - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex . 05/21/2021 00:13:00 - INFO - faiss.loader - Loading faiss. 05/21/2021 00:13:00 INFO - faiss.loader - Loading faiss. 05/21/2021 00:13:01 - INFO - elasticsearch - HEAD http://localhost:9200/ [status:200 request:0.018s] 05/21/2021 00:13:01 - INFO - elasticsearch - HEAD http://localhost:9200/cv [status:200 request:0.005s] 05/21/2021 00:13:01 - INFO - elasticsearch - GET http://localhost:9200/cv [status:200 request:0.009s] 05/21/2021 00:13:01 - INFO - elasticsearch PUT http://localhost:9200/cv/_mapping [status:200 request:0.041s] 05/21/2021 00:13:01 - INFO - elasticsearch - HEAD http://localhost:9200/label [status:200 request:0.008s] 05/21/2021 00:13:01 - INFO - farm.utils - Using device: CPU 05/21/2021 00:13:01 INFO - farm.utils - Number of GPUs: 0 05/21/2021 00:13:01 - INFO - farm.utils - Distributed Training: False 05/21/2021 00:13:01 - INFO farm.utils - Automatic Mixed Precision: None Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias'] This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-multilingual-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. 05/21/2021 00:13:21 - WARNING - farm.utils - ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow. 05/21/2021 00:13:21 - INFO - farm.utils - Using device: CPU 05/21/2021 00:13:21 - INFO - farm.utils - Number of GPUs: 0 05/21/2021 00:13:21 - INFO - farm.utils - Distributed Training: False 05/21/2021 00:13:21 - INFO farm.utils - Automatic Mixed Precision: None 05/21/2021 00:13:21 - INFO - farm.infer - Got ya 3 parallel workers to do inference ... 05/21/2021 00:13:21 - INFO - farm.infer - 0 0 0 05/21/2021 00:13:21 - INFO - farm.infer - /w\ /w\ /w\ 05/21/2021 00:13:21 - INFO - farm.infer - /'\ / \ /'\ 05/21/2021 00:13:21 - INFO - farm.infer - Exception ignored in: <function Pool.del at 0x000001BBA1DC9C10> Traceback (most recent call last): File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\pool.py", line 268, in del File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\queues.py", line 362, in put AttributeError: 'NoneType' object has no attribute 'dumps' This is my main.py file: from haystack.reader.farm import FARMReader from haystack.document_store.elasticsearch import ElasticsearchDocumentStore from haystack.retriever.sparse import ElasticsearchRetriever document_store = ElasticsearchDocumentStore( host="localhost", username="", password="", index="cv", embedding_dim=768, embedding_field="embedding") retriever = ElasticsearchRetriever(document_store=document_store) reader = FARMReader(model_name_or_path='bert-base-multilingual-uncased') NOTICE: My elasticsearch server has been started successfully!
Seems like an issue with multiprocessing on Windows. You can disable multiprocessing for the FARMReader like this: ... reader = FARMReader(model_name_or_path='bert-base-multilingual-uncased', num_processes=0) See also the docs for more details.
Running Sqoop with Oozie Error: Can not create a Path from an empty string
I am trying to Run Sqoop export with Oozie. I can run simple Sqoop commands (list-tables etc) and I can run my Sqoop export command from the cmd line, however when I run with Oozie I get the following error in my Yarn logs: Error: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/mnt/resource/hadoop/yarn/local/filecache/41235/mapreduce.tar.gz/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/mnt/resource/hadoop/yarn/local/filecache/41709/slf4j-log4j12-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Note: /tmp/sqoop-yarn/compile/ff5ff27843de6fb697dddfb18c85dbbb/tmp_fact_kpi_da20.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:126) at org.apache.hadoop.fs.Path.<init>(Path.java:134) at org.apache.hadoop.mapreduce.JobResourceUploader.uploadFiles(JobResourceUploader.java:127) at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:95) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:190) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) at org.apache.sqoop.mapreduce.ExportJobBase.doSubmitJob(ExportJobBase.java:326) at org.apache.sqoop.mapreduce.ExportJobBase.runJob(ExportJobBase.java:303) at org.apache.sqoop.mapreduce.ExportJobBase.runExport(ExportJobBase.java:444) at org.apache.sqoop.manager.SQLServerManager.exportTable(SQLServerManager.java:192) at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:81) at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:100) at org.apache.sqoop.Sqoop.run(Sqoop.java:147) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:225) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234) at org.apache.sqoop.Sqoop.main(Sqoop.java:243) at org.apache.oozie.action.hadoop.SqoopMain.runSqoopJob(SqoopMain.java:197) at org.apache.oozie.action.hadoop.SqoopMain.run(SqoopMain.java:179) at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:58) at org.apache.oozie.action.hadoop.SqoopMain.main(SqoopMain.java:48) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:239) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) My workflow.xml is: <workflow-app name="${jobName}" xmlns="uri:oozie:workflow:0.1"> <start to="sqoop-export" /> <action name="sqoop-export"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>oozie.action.sharelib.for.sqoop</name> <value>sqoop,hive,hcatalog</value> </property> <property> <name>oozie.sqoop.log.level</name> <value>${debugLevel}</value> </property> <property> <name>mapred.reduce.tasks</name> <value>1</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://****:9083</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/apps/hive/warehouse</value> </property> </configuration> <command>export --hcatalog-database modeling_reporting --hcatalog-table fact_kpi_da20 --table tmp_fact_kpi_da20 --connect jdbc:sqlserver://****.database.windows.net:1433;databaseName=****;user=****;password=**** </command> </sqoop> <ok to="end"/> <error to="sqoop-load-fail"/> </action> <kill name="sqoop-load-fail"> <message>Sqoop export failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end" /> </workflow-app> And my job.properties includes: oozie.use.system.libpath=true oozie.wf.application.path=/user/abc I run the job with: oozie job -config job.properties -run Additional logs show the job is able to connect to my destination table and verifies that my columns match: 7222 [main] DEBUG org.apache.sqoop.orm.CompilationManager - Finished writing jar file /tmp/sqoop-yarn/compile/24e897ef3439fabb89090a4dbe4c9be1/tmp_fact_kpi_da20.jar 7222 [main] DEBUG org.apache.sqoop.orm.CompilationManager - Finished writing jar file /tmp/sqoop-yarn/compile/24e897ef3439fabb89090a4dbe4c9be1/tmp_fact_kpi_da20.jar 7235 [main] INFO org.apache.sqoop.mapreduce.ExportJobBase - Beginning export of tmp_fact_kpi_da20 7235 [main] INFO org.apache.sqoop.mapreduce.ExportJobBase - Beginning export of tmp_fact_kpi_da20 7235 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 7240 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.jar is deprecated. Instead, use mapreduce.job.jar 7240 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.map.max.attempts is deprecated. Instead, use mapreduce.map.maxattempts 7240 [main] INFO org.apache.sqoop.mapreduce.ExportJobBase - Configuring HCatalog for export job 7240 [main] INFO org.apache.sqoop.mapreduce.ExportJobBase - Configuring HCatalog for export job 7257 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Configuring HCatalog specific details for job 7257 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Configuring HCatalog specific details for job 7493 [main] DEBUG org.apache.sqoop.manager.SqlManager - Execute getColumnInfoRawQuery : SELECT t.* FROM [tmp_fact_kpi_da20] AS t WHERE 1=0 7493 [main] DEBUG org.apache.sqoop.manager.SqlManager - Execute getColumnInfoRawQuery : SELECT t.* FROM [tmp_fact_kpi_da20] AS t WHERE 1=0 7493 [main] DEBUG org.apache.sqoop.manager.SqlManager - Using fetchSize for next query: 1000 7493 [main] DEBUG org.apache.sqoop.manager.SqlManager - Using fetchSize for next query: 1000 7493 [main] INFO org.apache.sqoop.manager.SqlManager - Executing SQL statement: SELECT t.* FROM [tmp_fact_kpi_da20] AS t WHERE 1=0 7493 [main] INFO org.apache.sqoop.manager.SqlManager - Executing SQL statement: SELECT t.* FROM [tmp_fact_kpi_da20] AS t WHERE 1=0 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column eventdate of type [-9, 10, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column eventdate of type [-9, 10, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column clientregion of type [-9, 4, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column clientregion of type [-9, 4, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column clientjourney of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column clientjourney of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column eventtype of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column eventtype of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column eventreason of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column eventreason of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column feature of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column feature of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column customerdevice of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column customerdevice of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column customerbrowser of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column customerbrowser of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column customercountryiso2 of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column customercountryiso2 of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column clientcurrencyiso3 of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column clientcurrencyiso3 of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column eventcount of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column eventcount of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column uniqueeventcount of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column uniqueeventcount of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column sales of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column sales of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column salesvalue of type [3, 38, 2] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column salesvalue of type [3, 38, 2] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column salesvaluegbp of type [3, 38, 2] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column salesvaluegbp of type [3, 38, 2] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column started_customersurveys of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column started_customersurveys of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column completed_customersurveys of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column completed_customersurveys of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column started_emailsubscriptions of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column started_emailsubscriptions of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column completed_emailsubscriptions of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column completed_emailsubscriptions of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column started_problemsolversurveys of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column started_problemsolversurveys of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column completed_problemsolversurveys of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column completed_problemsolversurveys of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column scenario of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column scenario of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column abtestgroup of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column abtestgroup of type [-9, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column abtestid of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column abtestid of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column abtestiscontrol of type [-7, 1, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column abtestiscontrol of type [-7, 1, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column appversion of type [12, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column appversion of type [12, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column agentid of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column agentid of type [-5, 19, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column pdate of type [12, 50, 0] 7582 [main] DEBUG org.apache.sqoop.manager.SqlManager - Found column pdate of type [12, 50, 0] 7670 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Database column names projected : [eventdate, clientregion, clientjourney, eventtype, eventreason, feature, customerdevice, customerbrowser, customercountryiso2, clientcurrencyiso3, eventcount, uniqueeventcount, sales, salesvalue, salesvaluegbp, started_customersurveys, completed_customersurveys, started_emailsubscriptions, completed_emailsubscriptions, started_problemsolversurveys, completed_problemsolversurveys, scenario, abtestgroup, abtestid, abtestiscontrol, appversion, agentid, pdate] 7670 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Database column names projected : [eventdate, clientregion, clientjourney, eventtype, eventreason, feature, customerdevice, customerbrowser, customercountryiso2, clientcurrencyiso3, eventcount, uniqueeventcount, sales, salesvalue, salesvaluegbp, started_customersurveys, completed_customersurveys, started_emailsubscriptions, completed_emailsubscriptions, started_problemsolversurveys, completed_problemsolversurveys, scenario, abtestgroup, abtestid, abtestiscontrol, appversion, agentid, pdate] 7670 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Database column name - info map : started_customersurveys : [Type : -5,Precision : 19,Scale : 0] pdate : [Type : 12,Precision : 50,Scale : 0] uniqueeventcount : [Type : -5,Precision : 19,Scale : 0] sales : [Type : -5,Precision : 19,Scale : 0] customerbrowser : [Type : -9,Precision : 50,Scale : 0] salesvalue : [Type : 3,Precision : 38,Scale : 2] abtestiscontrol : [Type : -7,Precision : 1,Scale : 0] feature : [Type : -9,Precision : 50,Scale : 0] scenario : [Type : -9,Precision : 50,Scale : 0] clientregion : [Type : -9,Precision : 4,Scale : 0] eventcount : [Type : -5,Precision : 19,Scale : 0] customercountryiso2 : [Type : -9,Precision : 50,Scale : 0] completed_emailsubscriptions : [Type : -5,Precision : 19,Scale : 0] salesvaluegbp : [Type : 3,Precision : 38,Scale : 2] abtestid : [Type : -5,Precision : 19,Scale : 0] agentid : [Type : -5,Precision : 19,Scale : 0] started_emailsubscriptions : [Type : -5,Precision : 19,Scale : 0] completed_problemsolversurveys : [Type : -5,Precision : 19,Scale : 0] appversion : [Type : 12,Precision : 50,Scale : 0] customerdevice : [Type : -9,Precision : 50,Scale : 0] clientjourney : [Type : -9,Precision : 50,Scale : 0] eventdate : [Type : -9,Precision : 10,Scale : 0] eventreason : [Type : -9,Precision : 50,Scale : 0] abtestgroup : [Type : -9,Precision : 50,Scale : 0] clientcurrencyiso3 : [Type : -9,Precision : 50,Scale : 0] completed_customersurveys : [Type : -5,Precision : 19,Scale : 0] started_problemsolversurveys : [Type : -5,Precision : 19,Scale : 0] eventtype : [Type : -9,Precision : 50,Scale : 0] 7670 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Database column name - info map : started_customersurveys : [Type : -5,Precision : 19,Scale : 0] pdate : [Type : 12,Precision : 50,Scale : 0] uniqueeventcount : [Type : -5,Precision : 19,Scale : 0] sales : [Type : -5,Precision : 19,Scale : 0] customerbrowser : [Type : -9,Precision : 50,Scale : 0] salesvalue : [Type : 3,Precision : 38,Scale : 2] abtestiscontrol : [Type : -7,Precision : 1,Scale : 0] feature : [Type : -9,Precision : 50,Scale : 0] scenario : [Type : -9,Precision : 50,Scale : 0] clientregion : [Type : -9,Precision : 4,Scale : 0] eventcount : [Type : -5,Precision : 19,Scale : 0] customercountryiso2 : [Type : -9,Precision : 50,Scale : 0] completed_emailsubscriptions : [Type : -5,Precision : 19,Scale : 0] salesvaluegbp : [Type : 3,Precision : 38,Scale : 2] abtestid : [Type : -5,Precision : 19,Scale : 0] agentid : [Type : -5,Precision : 19,Scale : 0] started_emailsubscriptions : [Type : -5,Precision : 19,Scale : 0] completed_problemsolversurveys : [Type : -5,Precision : 19,Scale : 0] appversion : [Type : 12,Precision : 50,Scale : 0] customerdevice : [Type : -9,Precision : 50,Scale : 0] clientjourney : [Type : -9,Precision : 50,Scale : 0] eventdate : [Type : -9,Precision : 10,Scale : 0] eventreason : [Type : -9,Precision : 50,Scale : 0] abtestgroup : [Type : -9,Precision : 50,Scale : 0] clientcurrencyiso3 : [Type : -9,Precision : 50,Scale : 0] completed_customersurveys : [Type : -5,Precision : 19,Scale : 0] started_problemsolversurveys : [Type : -5,Precision : 19,Scale : 0] eventtype : [Type : -9,Precision : 50,Scale : 0] 7834 [main] INFO org.apache.hive.hcatalog.common.HiveClientCache - Initializing cache: eviction-timeout=120 initial-capacity=50 maximum-capacity=50 7872 [main] INFO hive.metastore - Trying to connect to metastore with URI thrift://d-u2-prcs-sv-01.veproduction.dom:9083 7917 [main] INFO hive.metastore - Connected to metastore. 10113 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - HCatalog full table schema fields = [eventdate, clientregion, clientjourney, eventtype, eventreason, feature, customerdevice, customerbrowser, customercountryiso2, clientcurrencyiso3, eventcount, uniqueeventcount, sales, salesvalue, salesvaluegbp, started_customersurveys, completed_customersurveys, started_emailsubscriptions, completed_emailsubscriptions, started_problemsolversurveys, completed_problemsolversurveys, scenario, abtestgroup, abtestid, abtestiscontrol, appversion, agentid, pdate] 10113 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - HCatalog full table schema fields = [eventdate, clientregion, clientjourney, eventtype, eventreason, feature, customerdevice, customerbrowser, customercountryiso2, clientcurrencyiso3, eventcount, uniqueeventcount, sales, salesvalue, salesvaluegbp, started_customersurveys, completed_customersurveys, started_emailsubscriptions, completed_emailsubscriptions, started_problemsolversurveys, completed_problemsolversurveys, scenario, abtestgroup, abtestid, abtestiscontrol, appversion, agentid, pdate] 10849 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - HCatalog table partitioning key fields = [pdate] 10849 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - HCatalog table partitioning key fields = [pdate] 10849 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - HCatalog projected schema fields = [eventdate, clientregion, clientjourney, eventtype, eventreason, feature, customerdevice, customerbrowser, customercountryiso2, clientcurrencyiso3, eventcount, uniqueeventcount, sales, salesvalue, salesvaluegbp, started_customersurveys, completed_customersurveys, started_emailsubscriptions, completed_emailsubscriptions, started_problemsolversurveys, completed_problemsolversurveys, scenario, abtestgroup, abtestid, abtestiscontrol, appversion, agentid, pdate] 10849 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - HCatalog projected schema fields = [eventdate, clientregion, clientjourney, eventtype, eventreason, feature, customerdevice, customerbrowser, customercountryiso2, clientcurrencyiso3, eventcount, uniqueeventcount, sales, salesvalue, salesvaluegbp, started_customersurveys, completed_customersurveys, started_emailsubscriptions, completed_emailsubscriptions, started_problemsolversurveys, completed_problemsolversurveys, scenario, abtestgroup, abtestid, abtestiscontrol, appversion, agentid, pdate] 10889 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - HCatalog job : Hive Home = /usr/lib/hive 10889 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - HCatalog job : Hive Home = /usr/lib/hive 10889 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - HCatalog job: HCatalog Home = /usr/lib/hcatalog 10889 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - HCatalog job: HCatalog Home = /usr/lib/hcatalog 10920 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Adding jar files under /usr/lib/hcatalog/share/hcatalog to distributed cache 10920 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Adding jar files under /usr/lib/hcatalog/share/hcatalog to distributed cache 10920 [main] WARN org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - No files under /usr/lib/hcatalog/share/hcatalog to add to distributed cache for hcatalog job 10920 [main] WARN org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - No files under /usr/lib/hcatalog/share/hcatalog to add to distributed cache for hcatalog job 10920 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Adding jar files under /usr/lib/hcatalog/lib to distributed cache 10920 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Adding jar files under /usr/lib/hcatalog/lib to distributed cache 10920 [main] WARN org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - No files under /usr/lib/hcatalog/lib to add to distributed cache for hcatalog job 10920 [main] WARN org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - No files under /usr/lib/hcatalog/lib to add to distributed cache for hcatalog job 10920 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Adding jar files under /usr/lib/hive/lib to distributed cache 10920 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Adding jar files under /usr/lib/hive/lib to distributed cache 10920 [main] WARN org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - No files under /usr/lib/hive/lib to add to distributed cache for hcatalog job 10920 [main] WARN org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - No files under /usr/lib/hive/lib to add to distributed cache for hcatalog job 10920 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Adding jar files under /usr/lib/hcatalog/share/hcatalog/storage-handlers to distributed cache (recursively) 10920 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Adding jar files under /usr/lib/hcatalog/share/hcatalog/storage-handlers to distributed cache (recursively) 10920 [main] WARN org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - No files under /usr/lib/hcatalog/share/hcatalog/storage-handlers to add to distributed cache for hcatalog job 10920 [main] WARN org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - No files under /usr/lib/hcatalog/share/hcatalog/storage-handlers to add to distributed cache for hcatalog job 10921 [main] DEBUG org.apache.sqoop.mapreduce.JobBase - Using InputFormat: class org.apache.sqoop.mapreduce.hcat.SqoopHCatExportFormat 10921 [main] DEBUG org.apache.sqoop.mapreduce.JobBase - Using InputFormat: class org.apache.sqoop.mapreduce.hcat.SqoopHCatExportFormat 10921 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Configuring HCatalog for export job 10921 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Configuring HCatalog for export job 10921 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Ignoring configuration request for HCatalog info 10921 [main] INFO org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities - Ignoring configuration request for HCatalog info 11112 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 11112 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 11112 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 11113 [main] DEBUG org.apache.sqoop.mapreduce.JobBase - Adding to job classpath: file:/mnt/resource/hadoop/yarn/local/filecache/14417/sqoop-1.4.6.2.6.2.0-205.jar 11113 [main] DEBUG org.apache.sqoop.mapreduce.JobBase - Adding to job classpath: file:/mnt/resource/hadoop/yarn/local/filecache/14417/sqoop-1.4.6.2.6.2.0-205.jar 11114 [main] DEBUG org.apache.sqoop.mapreduce.JobBase - Adding to job classpath: file:/mnt/resource/hadoop/yarn/local/usercache/louiscronin/filecache/1096/mssql-jdbc-8.2.2.jre8.jar 11114 [main] DEBUG org.apache.sqoop.mapreduce.JobBase - Adding to job classpath: file:/mnt/resource/hadoop/yarn/local/usercache/louiscronin/filecache/1096/mssql-jdbc-8.2.2.jre8.jar 11115 [main] DEBUG org.apache.sqoop.mapreduce.JobBase - Adding to job classpath: file:/mnt/resource/hadoop/yarn/local/filecache/14417/sqoop-1.4.6.2.6.2.0-205.jar 11115 [main] DEBUG org.apache.sqoop.mapreduce.JobBase - Adding to job classpath: file:/mnt/resource/hadoop/yarn/local/filecache/14417/sqoop-1.4.6.2.6.2.0-205.jar 11116 [main] DEBUG org.apache.sqoop.mapreduce.JobBase - Adding to job classpath: file:/mnt/resource/hadoop/yarn/local/filecache/14417/sqoop-1.4.6.2.6.2.0-205.jar 11116 [main] DEBUG org.apache.sqoop.mapreduce.JobBase - Adding to job classpath: file:/mnt/resource/hadoop/yarn/local/filecache/14417/sqoop-1.4.6.2.6.2.0-205.jar 11116 [main] WARN org.apache.sqoop.mapreduce.JobBase - SQOOP_HOME is unset. May not be able to find all job dependencies. 11116 [main] WARN org.apache.sqoop.mapreduce.JobBase - SQOOP_HOME is unset. May not be able to find all job dependencies. 11210 [main] INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at d-u2-prcs-nm-03.veproduction.dom/172.28.50.22:10200 11330 [main] INFO org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider - Looking for the active RM in [rm1, rm2]... 11336 [main] INFO org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider - Found active RM [rm2] 11518 [main] INFO org.apache.hadoop.mapreduce.JobSubmitter - Cleaning up the staging area /user/louiscronin/.staging/job_1612961662367_2162 11546 [main] WARN org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor - Disabling threads for Delete operation as thread count 0 is <= 1 11554 [main] INFO org.apache.hadoop.fs.azure.AzureFileSystemThreadPoolExecutor - Time taken for Delete operation is: 9 ms with threads: 0 11603 [main] ERROR org.apache.sqoop.Sqoop - Got exception running Sqoop: java.lang.IllegalArgumentException: Can not create a Path from an empty string 11603 [main] ERROR org.apache.sqoop.Sqoop - Got exception running Sqoop: java.lang.IllegalArgumentException: Can not create a Path from an empty string
Apache PIG, ELEPHANTBIRDJSON Loader
I'm trying to parse below input (there are 2 records in this input)using Elephantbird json loader [{"node_disk_lnum_1":36,"node_disk_xfers_in_rate_sum":136.40000000000001,"node_disk_bytes_in_rate_22": 187392.0, "node_disk_lnum_7": 13}] [{"node_disk_lnum_1": 36, "node_disk_xfers_in_rate_sum": 105.2,"node_disk_bytes_in_rate_22": 123084.8, "node_disk_lnum_7":13}] Here is my syntax: register '/home/data/Desktop/elephant-bird-pig-4.1.jar'; a = LOAD '/pig/tc1.log' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]); b = FOREACH a GENERATE flatten(json#'node_disk_lnum_1') AS node_disk_lnum_1,flatten(json#'node_disk_xfers_in_rate_sum') AS node_disk_xfers_in_rate_sum,flatten(json#'node_disk_bytes_in_rate_22') AS node_disk_bytes_in_rate_22, flatten(json#'node_disk_lnum_7') AS node_disk_lnum_7; DESCRIBE b; b describe result: b: {node_disk_lnum_1: bytearray,node_disk_xfers_in_rate_sum: bytearray,node_disk_bytes_in_rate_22: bytearray,node_disk_lnum_7: bytearray} c = FOREACH b GENERATE node_disk_lnum_1; DESCRIBE c; c: {node_disk_lnum_1: bytearray} DUMP c; Expected Result: 36, 136.40000000000001, 187392.0, 13 36, 105.2, 123084.8, 13 Throwing the below error 2017-02-06 01:05:49,337 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN 2017-02-06 01:05:49,386 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code. 2017-02-06 01:05:49,387 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]} 2017-02-06 01:05:49,390 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Map key required for a: $0->[node_disk_lnum_1, node_disk_xfers_in_rate_sum, node_disk_bytes_in_rate_22, node_disk_lnum_7] 2017-02-06 01:05:49,395 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2017-02-06 01:05:49,398 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2017-02-06 01:05:49,398 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2017-02-06 01:05:49,425 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job 2017-02-06 01:05:49,426 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2017-02-06 01:05:49,428 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. com/twitter/elephantbird/util/HadoopCompat Please help what am I missing?
You do not have any nested data in your json,so remove -nestedload a = LOAD '/pig/tc1.log' USING com.twitter.elephantbird.pig.load.JsonLoader() as (json:map[]);
French coreference annotation using CoreNLP
can someone help me to correct my setting for performing coreference annotation for French by using coreNLP? I have tryed the basic suggestion by editing the properties file: annotators = tokenize, ssplit, pos, parse, lemma, ner, parse, depparse, mention, coref tokenize.language = fr pos.model = edu/stanford/nlp/models/pos-tagger/french/french.tagger parse.model = edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz The command: java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props frenchProps.properties -file frenchFile.txt which gets the following output log: [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/french/french.tagger ... done [0.3 sec]. [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse [main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz ... done [2.2 sec]. [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [2.0 sec]. Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.7 sec]. Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.9 sec]. [main] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1. Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt ago 23, 2016 5:37:34 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules INFORMACIÓN: Read 83 rules Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt ago 23, 2016 5:37:34 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules INFORMACIÓN: Read 267 rules Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt ago 23, 2016 5:37:34 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules INFORMACIÓN: Read 25 rules [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ... PreComputed 100000, Elapsed Time: 1.639 (s) Initializing dependency parser done [6.4 sec]. [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator mention Using mention detector type: rule [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator coref Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3664) at java.lang.String.<init>(String.java:207) at java.lang.StringBuilder.toString(StringBuilder.java:407) at java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java:3097) at java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:2892) at java.io.ObjectInputStream.readString(ObjectInputStream.java:1646) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373) at java.util.HashMap.readObject(HashMap.java:1402) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1909) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373) at edu.stanford.nlp.io.IOUtils.readObjectFromURLOrClasspathOrFileSystem(IOUtils.java:324) at edu.stanford.nlp.scoref.SimpleLinearClassifier.<init>(SimpleLinearClassifier.java:30) at edu.stanford.nlp.scoref.PairwiseModel.<init>(PairwiseModel.java:75) at edu.stanford.nlp.scoref.PairwiseModel$Builder.build(PairwiseModel.java:57) at edu.stanford.nlp.scoref.ClusteringCorefSystem.<init>(ClusteringCorefSystem.java:31) at edu.stanford.nlp.scoref.StatisticalCorefSystem.fromProps(StatisticalCorefSystem.java:48) at edu.stanford.nlp.pipeline.CorefAnnotator.<init>(CorefAnnotator.java:66) at edu.stanford.nlp.pipeline.AnnotatorImplementations.coref(AnnotatorImplementations.java:220) at edu.stanford.nlp.pipeline.AnnotatorFactories$13.create(AnnotatorFactories.java:515) at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:85) at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:375) Which made me to think there are extra missing configuration stuff.
AFAIK CoreNLP doesn't offer coreference resolution for French. (see also http://stanfordnlp.github.io/CoreNLP/coref.html)
Stanford CoreNLP dedicated server ignoring annotators input
I'm running the CoreNLP dedicated server on AWS and trying to make a request from ruby. The server seems to be receiving the request correctly but the issue is the server seems to ignore the input annotators list and always default to all annotators. My Ruby code to make the request looks like so: uri = URI.parse(URI.encode('http://ec2-************.compute.amazonaws.com//?properties={"tokenize.whitespace": "true", "annotators": "tokenize,ssplit,pos", "outputFormat": "json"}')) http = Net::HTTP.new(uri.host, uri.port) request = Net::HTTP::Post.new("/v1.1/auth") request.add_field('Content-Type', 'application/json') request.body = text response = http.request(request) json = JSON.parse(response.body) In the nohup.out logs on the server I see the following: [/38.122.182.107:53507] API call w/annotators tokenize,ssplit,pos,depparse,lemma,ner,mention,coref,natlog,openie .... INPUT TEXT BLOCK HERE .... [pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize [pool-1-thread-1] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer. [pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit [pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [2.0 sec]. [pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ... PreComputed 100000, Elapsed Time: 2.259 (s) Initializing dependency parser done [5.1 sec]. [pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma [pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [2.6 sec]. Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [1.2 sec]. Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [7.2 sec]. [pool-1-thread-1] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1. Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt Feb 22, 2016 11:37:20 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules INFO: Read 83 rules Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt Feb 22, 2016 11:37:20 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules INFO: Read 267 rules Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt Feb 22, 2016 11:37:20 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules INFO: Read 25 rules [pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator mention Using mention detector type: dependency [pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator coref etc etc. When I run test queries using wget on the command line it seems to work fine. wget --post-data 'the quick brown fox jumped over the lazy dog' 'ec2-*******.compute.amazonaws.com/?properties={"tokenize.whitespace": "true", "annotators": "tokenize,ssplit,pos", "outputFormat": "json"}' -O - Any help as to why this is happening would be appreicated thanks!
It turns out the request was being constructed incorrectly. The path should be in the argument to the Post.new. Corrected code below in case it helps anyone: host = "http://ec2-***********.us-west-2.compute.amazonaws.com" path = '/?properties={"tokenize.whitespace": "true", "annotators": "tokenize,ssplit,pos", "outputFormat": "json"}' encoded_path = URI.encode(path) uri = URI.parse(URI.encode(host)) http = Net::HTTP.new(uri.host, uri.port) http.set_debug_output($stdout) # request = Net::HTTP::Post.new("/v1.1/auth") request = Net::HTTP::Post.new(encoded_path) request.add_field('Content-Type', 'application/json') request.body = text response = http.request(request) json = JSON.parse(response.body)