Solr is printing a lot of logs with empty params - performance

Solr is printing lot of lines like below in slow_query logs.
solr_slow_requests.log.1:2023-02-03 17:30:27.084 WARN (qtp1961945640-50747) [c:products s:shard4 r:core_node47 x:products_shard4_replica_p44] o.a.s.c.S.SlowRequest slow: [products_shard4_replica_p44] webapp=/solr path=/select params={} rid=10.0.61.80-5704218 hits=9309 status=0 QTime=1280
solr_slow_requests.log.1:2023-02-03 17:30:27.157 WARN (qtp1961945640-50744) [c:products s:shard4 r:core_node47 x:products_shard4_replica_p44] o.a.s.c.S.SlowRequest slow: [products_shard4_replica_p44] webapp=/solr path=/select params={} rid=10.0.61.80-5704223 hits=9730 status=0 QTime=1508
solr_slow_requests.log.1:2023-02-03 17:30:27.325 WARN (qtp1961945640-50742) [c:products s:shard5 r:core_node59 x:products_shard5_replica_p56] o.a.s.c.S.SlowRequest slow: [products_shard5_replica_p56] webapp=/solr path=/select params={} rid=10.0.61.80-5704234 hits=9309 status=0 QTime=1993
solr_slow_requests.log.1:2023-02-03 17:30:27.326 WARN (qtp1961945640-50746) [c:products s:shard4 r:core_node47 x:products_shard4_replica_p44] o.a.s.c.S.SlowRequest slow: [products_shard4_replica_p44] webapp=/solr path=/select params={} rid=10.0.61.80-5704235 hits=9309 status=0 QTime=1994
solr_slow_requests.log.1:2023-02-03 17:30:27.657 WARN (qtp1961945640-50668) [c:products s:shard2 r:core_node23 x:products_shard2_replica_p20] o.a.s.c.S.SlowRequest slow: [products_shard2_replica_p20] webapp=/solr path=/select params={} rid=10.0.61.80-5704247 hits=9730 status=0 QTime=1140
solr_slow_requests.log.1:2023-02-03 17:30:27.700 WARN (qtp1961945640-50757) [c:products s:shard3 r:core_node35 x:products_shard3_replica_p32] o.a.s.c.S.SlowRequest slow: [products_shard3_replica_p32] webapp=/solr path=/select params={} rid=10.0.61.80-5704249 hits=9730 status=0 QTime=1068
solr_slow_requests.log.1:2023-02-03 17:30:27.720 WARN (qtp1961945640-50661) [c:products s:shard3 r:core_node35 x:products_shard3_replica_p32] o.a.s.c.S.SlowRequest slow: [products_shard3_replica_p32] webapp=/solr path=/select params={} rid=10.0.61.80-5704254 hits=9309 status=0 QTime=1023
solr_slow_requests.log.1:2023-02-03 17:30:27.816 WARN (qtp1961945640-49782) [c:products s:shard6 r:core_node71 x:products_shard6_replica_p68] o.a.s.c.S.SlowRequest slow: [products_shard6_replica_p68] webapp=/solr path=/select params={} rid=10.0.61.80-5704262 hits=9730 status=0 QTime=1246
solr_slow_requests.log.1:2023-02-03 17:30:27.825 WARN (qtp1961945640-50750) [c:products s:shard5 r:core_node59 x:products_shard5_replica_p56] o.a.s.c.S.SlowRequest slow: [products_shard5_replica_p56] webapp=/solr path=/select params={} rid=10.0.61.80-5704263 hits=9730 status=0 QTime=1847
solr_slow_requests.log.1:2023-02-03 17:30:27.888 WARN (qtp1961945640-50711) [c:products s:shard3 r:core_node35 x:products_shard3_replica_p32] o.a.s.c.S.SlowRequest slow: [products_shard3_replica_p32] webapp=/solr path=/select params={} rid=10.0.61.80-5704266 hits=9309 status=0 QTime=1150
solr_slow_requests.log.1:2023-02-03 17:30:27.995 WARN (qtp1961945640-50734) [c:products s:shard5 r:core_node59 x:products_shard5_replica_p56] o.a.s.c.S.SlowRequest slow: [products_shard5_replica_p56] webapp=/solr path=/select params={} rid=10.0.61.80-5704277 hits=9730 status=0 QTime=1481
I have two queries:
Why it doesn't print the params? Why params is always empty? params={} even though our requests send a lot of params.
Why is it taking too much time while No CPU spike?
We have 6 Node cluster with 6 shards of product collection. We are using 16 core machines with 32 GB Ram each.
A sample of our query looks like this:
solr.log.1:2023-02-06 12:02:54.090 INFO (qtp1961945640-4677) [c:products s:shard1 r:core_node11 x:products_shard1_replica_p8] o.a.s.c.S.Request [products_shard1_replica_p8] webapp=/solr path=/select params={df=_text_&distrib=false&fl=id&fl=score&shards.purpose=16388&start=0&fsv=true&fq=channel_identifier:632aff00940b4e27c80986f3&fq=zone_identifier:"_all_"&fq=is_available:True&fq=image_nature:("standard"+OR+"substandard"+OR+"default")&fq=product_online_date:[*+TO+NOW]&fq={!tag%3Dbrand_id}brand_id:("74"+OR+"235")&sort=popularity+desc+,id+asc&shard.url=http://IP:8983/solr/products_shard1_replica_p8/|http://IP:8983/solr/products_shard1_replica_n2/|http://IP:8983/solr/products_shard1_replica_p6/|http://IP:8983/solr/products_shard1_replica_t4/|http://IP:8983/solr/products_shard1_replica_p10/|http://IP:8983/solr/products_shard1_replica_n1/&rows=11550&rid=IP-225159&version=2&q=*:*&omitHeader=false&NOW=1675684973896&json={"query":+"*:*",+"params":+{"df":+"_text_",+"_route_":+"632aff00940b4e27c80986f3/2!",+"start":+11500,+"rows":+50},+"fields":+["*+score"],+"filter":+["channel_identifier:632aff00940b4e27c80986f3",+"zone_identifier:\"_all_\"",+"is_available:True",+"image_nature:(\"standard\"+OR+\"substandard\"+OR+\"default\")",+"product_online_date:[*+TO+NOW]",+{"#brand_id":+"brand_id:(\"74\"+OR+\"235\")"}],+"sort":+"popularity+desc+,id+asc"}&isShard=true&wt=javabin&_route_=632aff00940b4e27c80986f3/2!} hits=42585 status=0 QTime=193
It is happening only during a Load Test. Some of the params of Solr are as below:
-Djetty.home=/opt/solr/server
-Djetty.port=8983
-Dlog4j2.formatMsgNoLookups=true
-Dnewrelic.environment=[]
-Dsolr.data.home=
-Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf
-Dsolr.documentCache.initialSize=8339
-Dsolr.documentCache.size=8339
-Dsolr.filterCache.initialSize=8339
-Dsolr.filterCache.size=8339
-Dsolr.install.dir=/opt/solr
-Dsolr.jetty.inetaccess.excludes=
-Dsolr.jetty.inetaccess.includes=
-Dsolr.log.dir=/var/solr/data/logs
-Dsolr.log.muteconsole
-Dsolr.queryResultCache.initialSize=6671
-Dsolr.queryResultCache.size=6671
-Dsolr.solr.home=/var/solr/data/data
-Duser.timezone=UTC
-DzkClientTimeout=30000
-XX:+AggressiveOpts
-XX:+AlwaysPreTouch
-XX:+ParallelRefProcEnabled
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+UseG1GC
-XX:+UseGCLogFileRotation
-XX:+UseLargePages
-XX:-OmitStackTraceInFastThrow
-XX:-OmitStackTraceInFastThrow
-XX:ConcGCThreads=4
-XX:G1ReservePercent=10
-XX:GCLogFileSize=20M
-XX:InitiatingHeapOccupancyPercent=80
-XX:MaxGCPauseMillis=100
-XX:MaxTenuringThreshold=8
-XX:NewRatio=3
-XX:NumberOfGCLogFiles=9
-XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/data/logs
-XX:ParallelGCThreads=4
-XX:PretenureSizeThreshold=64m
-XX:SurvivorRatio=4
-Xloggc:/var/solr/data/logs/solr_gc.log
-Xms20000m
-Xmx20000m
-Xss256k
-javaagent:/opt/solr/contrib/newrelic/newrelic.jar
-verbose:gc

Related

Logstash: NameError: undefined local variable or method `dotfile' for # <AwesomePrint::Inspector:0x77011d93>>

I'm migrating a logstash into a EC2 instance.
It's running a AmazonLinux.
By the command tail -f /var/log/logstash/logstash-plain.log
I'm getting a the follow log cycling/repeating
2017-12-20T15:30:24,742][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"netflow", :directory=>"/usr/share/logstash/modules/netflow/configuration"}
[2017-12-20T15:30:24,745][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"fb_apache", :directory=>"/usr/share/logstash/modules/fb_apache/configuration"}
[2017-12-20T15:30:27,342][INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[https://search-ivendas-sz2q3f573vro6xlncwjnvzbf2m.us-east-1.es.amazonaws.com:443/]}}
[2017-12-20T15:30:27,343][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>https://search-ivendas-sz2q3f573vro6xlncwjnvzbf2m.us-east-1.es.amazonaws.com:443/, :path=>"/"}
[2017-12-20T15:30:28,040][WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"https://search-ivendas-sz2q3f573vro6xlncwjnvzbf2m.us-east-1.es.amazonaws.com:443/"}
[2017-12-20T15:30:28,175][INFO ][logstash.outputs.elasticsearch] Using mapping template from {:path=>nil}
[2017-12-20T15:30:28,185][INFO ][logstash.outputs.elasticsearch] Attempting to install template {:manage_template=>{"template"=>"logstash-*", "version"=>50001, "settings"=>{"index.refresh_interval"=>"5s"}, "mappings"=>{"_default_"=>{"_all"=>{"enabled"=>true, "norms"=>false}, "dynamic_templates"=>[{"message_field"=>{"path_match"=>"message", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false}}}, {"string_fields"=>{"match"=>"*", "match_mapping_type"=>"string", "mapping"=>{"type"=>"text", "norms"=>false, "fields"=>{"keyword"=>{"type"=>"keyword", "ignore_above"=>256}}}}}], "properties"=>{"#timestamp"=>{"type"=>"date", "include_in_all"=>false}, "#version"=>{"type"=>"keyword", "include_in_all"=>false}, "geoip"=>{"dynamic"=>true, "properties"=>{"ip"=>{"type"=>"ip"}, "location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"}, "longitude"=>{"type"=>"half_float"}}}}}}}}
[2017-12-20T15:30:28,201][INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["//search-ivendas-sz2q3f573vro6xlncwjnvzbf2m.us-east-1.es.amazonaws.com:443"]}
[2017-12-20T15:30:28,385][INFO ][logstash.pipeline ] Starting pipeline {"id"=>"main", "pipeline.workers"=>2, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>5, "pipeline.max_inflight"=>250}
[2017-12-20T15:30:29,298][INFO ][logstash.pipeline ] Pipeline main started
[2017-12-20T15:30:29,502][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600}
[2017-12-20T15:30:29,979][FATAL][logstash.runner ] An unexpected error occurred! {:error=>#<NameError: undefined local variable or method `dotfile' for #<AwesomePrint::Inspector:0x18bafa48>>, :backtrace=>["/usr/share/logstash/vendor/bundle/jruby/1.9/gems/awesome_print-1.8.0/lib/awesome_print/inspector.rb:163:in `merge_custom_defaults!'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/awesome_print-1.8.0/lib/awesome_print/inspector.rb:50:in `initialize'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/awesome_print-1.8.0/lib/awesome_print/core_ext/kernel.rb:9:in `ai'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-codec-rubydebug-3.0.5/lib/logstash/codecs/rubydebug.rb:39:in `encode_default'", "org/jruby/RubyMethod.java:120:in `call'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-codec-rubydebug-3.0.5/lib/logstash/codecs/rubydebug.rb:35:in `encode'", "/usr/share/logstash/logstash-core/lib/logstash/codecs/base.rb:50:in `multi_encode'", "org/jruby/RubyArray.java:1613:in `each'", "/usr/share/logstash/logstash-core/lib/logstash/codecs/base.rb:50:in `multi_encode'", "/usr/share/logstash/logstash-core/lib/logstash/outputs/base.rb:90:in `multi_receive'", "/usr/share/logstash/logstash-core/lib/logstash/output_delegator_strategies/single.rb:15:in `multi_receive'", "org/jruby/ext/thread/Mutex.java:149:in `synchronize'", "/usr/share/logstash/logstash-core/lib/logstash/output_delegator_strategies/single.rb:14:in `multi_receive'", "/usr/share/logstash/logstash-core/lib/logstash/output_delegator.rb:49:in `multi_receive'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:434:in `output_batch'", "org/jruby/RubyHash.java:1342:in `each'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:433:in `output_batch'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:381:in `worker_loop'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:342:in `start_workers'"]}
I did installed the missing plugins, before I was getting another errors.
Is there someway to get more details about the problem ?
What am I missing ?
This is an issue with awesome-print plugin for rubydebug codec. set the HOME env variable (export HOME=<path_to_aprc_file>) which will be used to load .aprc configuration required by plugin. Refer this to persist this env variable.

StanfordCoreNLP - Setting pipelineLanguage to German not working?

I am using the pycorenlp client in order to talk to the Stanford CoreNLP Server. In my setup I am setting pipelineLanguage to german like this:
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
text = 'Das große Auto.'
output = nlp.annotate(text, properties={
'annotators': 'tokenize,ssplit,pos,depparse,parse',
'outputFormat': 'json',
'pipelineLanguage': 'german'
})
However, from the looks I'd say that it's not working:
output['sentences'][0]['tokens']
will return:
[{'after': ' ',
'before': '',
'characterOffsetBegin': 0,
'characterOffsetEnd': 3,
'index': 1,
'originalText': 'Das',
'pos': 'NN',
'word': 'Das'},
{'after': ' ',
'before': ' ',
'characterOffsetBegin': 4,
'characterOffsetEnd': 9,
'index': 2,
'originalText': 'große',
'pos': 'NN',
'word': 'große'},
{'after': '',
'before': ' ',
'characterOffsetBegin': 10,
'characterOffsetEnd': 14,
'index': 3,
'originalText': 'Auto',
'pos': 'NN',
'word': 'Auto'},
{'after': '',
'before': '',
'characterOffsetBegin': 14,
'characterOffsetEnd': 15,
'index': 4,
'originalText': '.',
'pos': '.',
'word': '.'}]
This should be more like
Das große Auto
POS: DT JJ NN
It seems to me that setting 'pipelineLanguage': 'de' does not work for some reason.
I've executed
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
in order to start the server.
I am getting the following from the logger:
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000
[pool-1-thread-3] ERROR CoreNLP - Failure to load language specific properties: StanfordCoreNLP-german.properties for german
[pool-1-thread-3] INFO CoreNLP - [/127.0.0.1:60700] API call w/annotators tokenize,ssplit,pos,depparse,parse
Das große Auto.
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[pool-1-thread-3] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.5 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ...
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 99996, Elapsed Time: 8.645 (s)
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [9.8 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-1-thread-3] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.3 sec].
Apparently the server is loading the models for the English language - without warning me about that.
Alright, I just downloaded the models jar for German from the website and moved it into the directory where I extracted the server e.g.
~/Downloads/stanford-corenlp-full-2017-06-09 $
After re-running the server, the model was successfully loaded.
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[pool-1-thread-3] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/german/german-hgc.tagger ... done [5.1 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model file: edu/stanford/nlp/models/parser/nndep/UD_German.gz ...
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 99984, Elapsed Time: 11.419 (s)
[pool-1-thread-3] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [12.2 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-1-thread-3] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/germanFactored.ser.gz ... done [1.0 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[pool-1-thread-3] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[pool-1-thread-3] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/german.conll.hgc_175m_600.crf.ser.gz ... done [0.7 sec].

French coreference annotation using CoreNLP

can someone help me to correct my setting for performing coreference annotation for French by using coreNLP? I have tryed the basic suggestion by editing the properties file:
annotators = tokenize, ssplit, pos, parse, lemma, ner, parse, depparse, mention, coref
tokenize.language = fr
pos.model = edu/stanford/nlp/models/pos-tagger/french/french.tagger
parse.model = edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz
The command:
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props frenchProps.properties -file frenchFile.txt
which gets the following output log:
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/french/french.tagger ... done [0.3 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz ...
done [2.2 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [2.0 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.7 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.9 sec].
[main] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt
ago 23, 2016 5:37:34 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFORMACIÓN: Read 83 rules
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt
ago 23, 2016 5:37:34 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFORMACIÓN: Read 267 rules
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
ago 23, 2016 5:37:34 PM edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor appendRules
INFORMACIÓN: Read 25 rules
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ...
PreComputed 100000, Elapsed Time: 1.639 (s)
Initializing dependency parser done [6.4 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator mention
Using mention detector type: rule
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator coref
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.StringBuilder.toString(StringBuilder.java:407)
at java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java:3097)
at java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:2892)
at java.io.ObjectInputStream.readString(ObjectInputStream.java:1646)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at java.util.HashMap.readObject(HashMap.java:1402)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1909)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at edu.stanford.nlp.io.IOUtils.readObjectFromURLOrClasspathOrFileSystem(IOUtils.java:324)
at edu.stanford.nlp.scoref.SimpleLinearClassifier.<init>(SimpleLinearClassifier.java:30)
at edu.stanford.nlp.scoref.PairwiseModel.<init>(PairwiseModel.java:75)
at edu.stanford.nlp.scoref.PairwiseModel$Builder.build(PairwiseModel.java:57)
at edu.stanford.nlp.scoref.ClusteringCorefSystem.<init>(ClusteringCorefSystem.java:31)
at edu.stanford.nlp.scoref.StatisticalCorefSystem.fromProps(StatisticalCorefSystem.java:48)
at edu.stanford.nlp.pipeline.CorefAnnotator.<init>(CorefAnnotator.java:66)
at edu.stanford.nlp.pipeline.AnnotatorImplementations.coref(AnnotatorImplementations.java:220)
at edu.stanford.nlp.pipeline.AnnotatorFactories$13.create(AnnotatorFactories.java:515)
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:85)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:375)
Which made me to think there are extra missing configuration stuff.
AFAIK CoreNLP doesn't offer coreference resolution for French. (see also http://stanfordnlp.github.io/CoreNLP/coref.html)

Nutch Elasticsearch Integration

I'm following this tutorial for setting up nutch alongwith Elasticsearch. Whenever I try to index the data into the ES, it returns an error. Following are the logs:-
Command:-
bin/nutch index elasticsearch -all
Logs when I add elastic.port(9200) in conf/nutch-site.xml :-
2016-05-05 13:22:49,903 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-05-05 13:22:49,904 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-05-05 13:22:49,904 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-05-05 13:22:49,904 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-05-05 13:22:49,905 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.metadata.MetadataIndexer
2016-05-05 13:22:49,906 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter
2016-05-05 13:22:49,961 INFO elastic.ElasticIndexWriter - Processing remaining requests [docs = 0, length = 0, total docs = 0]
2016-05-05 13:22:49,961 INFO elastic.ElasticIndexWriter - Processing to finalize last execute
2016-05-05 13:22:54,898 INFO client.transport - [Peggy Carter] failed to get node info for [#transport#-1][ubuntu][inet[localhost/127.0.0.1:9200]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9200]][cluster:monitor/nodes/info] request_id [1] timed out after [5000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-05-05 13:22:55,682 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2016-05-05 13:22:55,683 INFO indexer.IndexingJob - Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port (default 9300)
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
2016-05-05 13:22:55,711 INFO elasticsearch.plugins - [Adrian Toomes] loaded [], sites []
2016-05-05 13:23:00,763 INFO client.transport - [Adrian Toomes] failed to get node info for [#transport#-1][ubuntu][inet[localhost/127.0.0.1:92$0]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9200]][cluster:monitor/nodes/info] request_id [0] time$ out after [5000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-05-05 13:23:00,766 INFO indexer.IndexingJob - IndexingJob: done.
Logs when default port 9300 is used:-
2016-05-05 13:58:44,584 INFO elasticsearch.plugins - [Mentallo] loaded [], sites []
2016-05-05 13:58:44,673 WARN transport.netty - [Mentallo] Message not fully read (response) for [0] handler future(org.elasticsearch.client.transport.TransportClientNodesService$SimpleNodeSampler$1#3c80f1dd), error [true], resetting
2016-05-05 13:58:44,674 INFO client.transport - [Mentallo] failed to get node info for [#transport#-1][ubuntu][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.RemoteTransportException: Failed to deserialize exception response from stream
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize exception response from stream
at org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:173)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:125)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.StreamCorruptedException: Unsupported version: 1
at org.elasticsearch.common.io.ThrowableObjectInputStream.readStreamHeader(ThrowableObjectInputStream.java:46)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301)
at org.elasticsearch.common.io.ThrowableObjectInputStream.<init>(ThrowableObjectInputStream.java:38)
at org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:170)
... 23 more
2016-05-05 13:58:44,676 INFO indexer.IndexingJob - IndexingJob: done.
I've configured everything fine. Have had a look at various threads as well but to no avail. Also java version for both ES and JVM is same. Is there a bug in here?
I'm using Nutch 2.3.1 and have tried with both ES 1.4.4 and 2.3.2. I can see data in Mongo but I cannot index data in ES. Why??

Getting error while running query on hive over tez

Getting error while running query on hive over tez. As per logs, hive is failing while copying tez jars to a hdfs location on start of tez session.Below is the complete log obtained from hive log file :
2015-06-19 01:23:52,289 INFO [HiveServer2-Background-Pool: Thread-41]: ql.Driver (SessionState.java:printInfo(852)) - Query ID = saurabh_20150619012323_f52f1d6c-2adb-4edc-8ba4-b64d7d898325
2015-06-19 01:23:52,289 INFO [HiveServer2-Background-Pool: Thread-41]: ql.Driver (SessionState.java:printInfo(852)) - Total jobs = 1
2015-06-19 01:23:52,289 INFO [HiveServer2-Background-Pool: Thread-41]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - </PERFLOG method=TimeToSubmit start=1434657232288 end=1434657232289 duration=1 from=org.apache.hadoop.hive.ql.Driver>
2015-06-19 01:23:52,290 INFO [HiveServer2-Background-Pool: Thread-41]: log.PerfLogger (PerfLogger.java:PerfLogBegin(121)) - <PERFLOG method=runTasks from=org.apache.hadoop.hive.ql.Driver>
2015-06-19 01:23:52,290 INFO [HiveServer2-Background-Pool: Thread-41]: log.PerfLogger (PerfLogger.java:PerfLogBegin(121)) - <PERFLOG method=task.TEZ.Stage-1 from=org.apache.hadoop.hive.ql.Driver>
2015-06-19 01:23:52,302 INFO [HiveServer2-Background-Pool: Thread-41]: ql.Driver (SessionState.java:printInfo(852)) - Launching Job 1 out of 1
2015-06-19 01:23:52,302 INFO [HiveServer2-Background-Pool: Thread-41]: ql.Driver (Driver.java:launchTask(1630)) - Starting task [Stage-1:MAPRED] in parallel
2015-06-19 01:23:52,312 INFO [Thread-21]: session.SessionState (SessionState.java:start(488)) - No Tez session required at this point. hive.execution.engine=mr.
2015-06-19 01:23:52,314 INFO [Thread-21]: tez.TezSessionPoolManager (TezSessionPoolManager.java:getSession(125)) - QueueName: null nonDefaultUser: true defaultQueuePool: null blockingQueueLength: -1
2015-06-19 01:23:52,315 INFO [Thread-21]: tez.TezSessionPoolManager (TezSessionPoolManager.java:getNewSessionState(154)) - Created a new session for queue: null session id: 85d83746-a48e-419e-a7ca-8c98faf173ea
2015-06-19 01:23:52,380 INFO [Thread-21]: Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1049)) - mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
2015-06-19 01:23:52,412 INFO [Thread-21]: ql.Context (Context.java:getMRScratchDir(328)) - New scratch dir is hdfs://localhost:9000/tmp/hive/saurabh/e5a701ae-242d-488f-beec-cf18878becdc/hive_2015-06-19_01-23-49_794_2167174123575230985-2
2015-06-19 01:23:52,420 INFO [Thread-21]: exec.Task (TezTask.java:updateSession(233)) - Tez session hasn't been created yet. Opening session
2015-06-19 01:23:52,420 INFO [Thread-21]: tez.TezSessionState (TezSessionState.java:open(142)) - User of session id 85d83746-a48e-419e-a7ca-8c98faf173ea is saurabh
2015-06-19 01:23:52,433 INFO [Thread-21]: tez.DagUtils (DagUtils.java:localizeResource(950)) - Localizing resource because it does not exist: file:/usr/lib/tez/* to dest: hdfs://localhost:9000/tmp/hive/saurabh/_tez_session_dir/85d83746-a48e-419e-a7ca-8c98faf173ea/*
2015-06-19 01:23:52,433 INFO [Thread-21]: tez.DagUtils (DagUtils.java:localizeResource(954)) - Looks like another thread is writing the same file will wait.
2015-06-19 01:23:52,433 INFO [Thread-21]: tez.DagUtils (DagUtils.java:localizeResource(961)) - Number of wait attempts: 5. Wait interval: 5000
2015-06-19 01:24:17,449 ERROR [Thread-21]: tez.DagUtils (DagUtils.java:localizeResource(977)) - Could not find the jar that was being uploaded
2015-06-19 01:24:17,451 ERROR [Thread-21]: exec.Task (TezTask.java:execute(184)) - Failed to execute tez graph.
java.io.IOException: Previous writer likely failed to write hdfs://localhost:9000/tmp/hive/saurabh/_tez_session_dir/85d83746-a48e-419e-a7ca-8c98faf173ea/*. Failing because I am unlikely to write too.
at org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeResource(DagUtils.java:978)
at org.apache.hadoop.hive.ql.exec.tez.DagUtils.addTempResources(DagUtils.java:859)
at org.apache.hadoop.hive.ql.exec.tez.DagUtils.localizeTempFilesFromConf(DagUtils.java:802)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.refreshLocalResourcesFromConf(TezSessionState.java:228)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:154)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.updateSession(TezTask.java:234)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:75)
2015-06-19 01:24:18,329 ERROR [HiveServer2-Background-Pool: Thread-41]: ql.Driver (SessionState.java:printError(861)) - FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask
2015-06-19 01:24:18,329 INFO [HiveServer2-Background-Pool: Thread-41]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - </PERFLOG method=Driver.execute start=1434657232288 end=1434657258329 duration=26041 from=org.apache.hadoop.hive.ql.Driver>
2015-06-19 01:24:18,329 INFO [HiveServer2-Background-Pool: Thread-41]: log.PerfLogger (PerfLogger.java:PerfLogBegin(121)) - <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
2015-06-19 01:24:18,329 INFO [HiveServer2-Background-Pool: Thread-41]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - </PERFLOG method=releaseLocks start=1434657258329 end=1434657258329 duration=0 from=org.apache.hadoop.hive.ql.Driver>
2015-06-19 01:24:18,333 ERROR [HiveServer2-Background-Pool: Thread-41]: operation.Operation (SQLOperation.java:run(200)) - Error running hive query:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask
at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:315)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:147)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:70)
at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:209)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-06-19 01:24:18,342 INFO [HiveServer2-Handler-Pool: Thread-29]: exec.ListSinkOperator (Operator.java:close(595)) - 40 finished. closing...
2015-06-19 01:24:18,343 INFO [HiveServer2-Handler-Pool: Thread-29]: exec.ListSinkOperator (Operator.java:close(613)) - 40 Close done
2015-06-19 01:24:18,393 INFO [HiveServer2-Handler-Pool: Thread-29]: log.PerfLogger (PerfLogger.java:PerfLogBegin(121)) - <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
2015-06-19 01:24:18,394 INFO [HiveServer2-Handler-Pool: Thread-29]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - </PERFLOG method=releaseLocks start=1434657258393 end=1434657258394 duration=1 from=org.apache.hadoop.hive.ql.Driver>

Resources