Elasticsearch indexing fails after successful Nutch crawl - elasticsearch

I'm not sure why but Nutch 1.13 is failing to index the data to ES (v2.3.3). It is crawling, that is fine, but when it comes time to index to ES its giving me this error message:
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
Right before that is has this:
elastic.bulk.close.timeout : elastic timeout for the last bulk in seconds. (default 600)
I'm not sure exactly if the timeout has anything to do with the job failing?
I've run Nutch v1.10 many times with no problems but decided to upgrade now. Never had this error before until now, with upgrading.
EDIT:
After closer inspection of the error message:
Error running:
/home/david/tutorials/nutch/nutch-1.13/runtime/local/bin/nutch index -Delastic.server.url=http://localhost:9300/search-index/ searchcrawl//crawldb -linkdb searchcrawl//linkdb searchcrawl//segments/20170519125546
It seems to be failing there, on that particular segment, what does that mean? I only know the basics of how to use Nutch, I'm by no means an expert. Is it failing on a link?

Until Nutch 1.14 is out, you need to apply this patch https://github.com/apache/nutch/pull/156 and rebuild:
cd apache-nutch-1.13
wget https://raw.githubusercontent.com/apache/nutch/e040ace189aa0379b998c8852a09c1a1a2308d82/src/java/org/apache/nutch/indexer/CleaningJob.java
mv CleaningJob.java src/java/org/apache/nutch/indexer/.

Related

Opnenms UI error - HTTP ERROR: 503

I am facing issue while accessing the opennms UI throwing below error
HTTP ERROR: 503
Problem accessing /opennms/alarm/detail.htm. Reason:
Service Unavailable
OpenNMS Version: 1.10.10
Let me know if you need any trpe of logs detail
I know this is a frustrating answer, but seriously, upgrade. Version 1.10.10 is simply ancient.
The version numbering for OpenNMS has dropped the "1." so you are basically on version "10" while the current version is "22".
If you want to try to fix this on 1.10 the problem is usually caused by an error in a .properties file. What I would do is stop OpenNMS:
/opt/opennms/bin/opennms stop
and then remove the log files
rm -rf /opt/opennms/logs/*
then restart OpenNMS
/opt/opennms/bin/opennms start
While that won't fix the problem, you can then
cd /opt/opennms/logs
and then grep for ERROR or FATAL log messages. Those should tell you which file has the problem.

LogStash::ConfigurationError: com.mysql.jdbc.Driver not loaded

When I use the logstash_input_jdbc plugin sync MySQL and my local elastic search,
The below errors appear, But I search for a long time, but I have no resolve method until now.
./logstash -f ./logstash_jdbc_test/jdbc.conf
Pipeline aborted due to error {:exception=>#,
:backtrace=>["/usr/local/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-jdbc-3.0.2/lib/logstash/plugin_mixins/jdbc.rb:156:in
prepare_jdbc_connection'",
"/usr/local/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-jdbc-3.0.2/lib/logstash/inputs/jdbc.rb:167:in
register'",
"/usr/local/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.2-java/lib/logstash/pipeline.rb:330:in
start_inputs'", "org/jruby/RubyArray.java:1613:ineach'",
"/usr/local/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.2-java/lib/logstash/pipeline.rb:329:in
start_inputs'",
"/usr/local/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.2-java/lib/logstash/pipeline.rb:180:in
start_workers'",
"/usr/local/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.2-java/lib/logstash/pipeline.rb:136:in
run'",
"/usr/local/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.2-java/lib/logstash/agent.rb:465:in
start_pipeline'"], :level=>:error}
Yesterday, I find the reason.
The reason is:
In my install path /elasticsearch-jdbc-2.3.2.0/lib, the size of mysql-connector-java-5.1.38.jar is zero.
So I download the new mysql-connector-java-5.1.38.jar, and copy to the path of /elasticsearch-jdbc-2.3.2.0/lib.
And then, my problem resolved.
Now I can sync date between mysql and elaticsearch quickly.

Error with nutch 1.11 : ....org.apache.hadoop.fs.FileStatus.isDirectory()Z

I want to make an application in java like Google news.
For that I am doing that from scratch and doing basic setup with Nutch.
I am done with installation but getting error in one command.
Here is brief about tech. I am using
-nutch 1.11
-Cygwin
My first command was :
$ bin/nutch
which gives me perfect output.
Then I did URI crawling like :
$ bin/nutch inject crawl/crawldb urls
Which created crawldb folder and crawl given url
Now I want to generate segments and which gives me given Error :
$ bin/nutch generate crawl/crawldb crawl/segments
Generator: starting at 2016-04-14 17:30:29
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20160414173032
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.isDirectory()Z
at org.apache.nutch.util.LockUtil.removeLockFile(LockUtil.java:79)
at org.apache.nutch.crawl.Generator.generate(Generator.java:637)
at org.apache.nutch.crawl.Generator.run(Generator.java:743)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Generator.main(Generator.java:699)
I am not getting the problem. Is there mismatch between jars or having any other problem....
Did you built nutch by yourself or used the packaged version? I've just checkout the 1.11 branch of the Nutch repo and built it, executing your commands give the right output with no exception at all. Granted I've tested this on my local system (OS X) which is not windows/cygwin, but this shouldn't be a problem.
The 1.11 nutch branch is using hadoop 2.4.0, you can checkout which versions of hadoop are being pulled from the maven repo in the runtime/local/lib/ folder, check the hadoop-* files.

Elasticsearch returning 504 when trying to index on Heroku

I've added elasticsearch to my Rails app using Tire as outlined in this Railscast.
I've tried to deploy to Heroku with Bonsai add-on. After following this tutorial and also using information w Based on this question, I've tried running this command:
heroku run rake environment tire:import CLASS=Document FORCE=true
(Document is, of course, the name of my model.)
But I keep getting this error message:
Running `rake environment tire:import CLASS=Document FORCE=true` attached to terminal... up, run.4773
[IMPORT] Deleting index 'documents'
[IMPORT] Creating index 'documents' with mapping:
{"document":{"properties":{}}}
[ERROR] There has been an error when creating the index -- Elasticsearch returned:
504 :
What am I doing wrong?
You might not have done anything wrong. Bonsai has been experiencing issues for the past 18 hours. Your 504 error may just be a result of this.
See this tweet: https://twitter.com/bonsaisearch/status/394950014361165824

HPCC/HDFS Connector

Does anyone know about HPCC/HDFS connector.we are using both HPCC and HADOOP.There is one utility(HPCC/HDFS connector) developed by HPCC which allows HPCC cluster to acess HDFS data
i have installed the connector but when i run the program to acess data from hdfs it gives error as libhdfs.so.0 doesn't exist.
I tried to build libhdfs.so using command
ant compile-libhdfs -Dlibhdfs=1
its giving me error as
target "compile-libhdfs" does not exist in the project "hadoop"
i used one more command
ant compile-c++-libhdfs -Dlibhdfs=1
its giving error as
ivy-download:
[get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar
[get] To: /home/hadoop/hadoop-0.20.203.0/ivy/ivy-2.1.0.jar
[get] Error getting http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar
to /home/hadoop/hadoop-0.20.203.0/ivy/ivy-2.1.0.jar
BUILD FAILED java.net.ConnectException: Connection timed out
any suggestion will be a great help
Chhaya, you might not need to build libhdfs.so, depending on how you installed hadoop, you might already have it.
Check in HADOOP_LOCATION/c++/Linux-<arch>/lib/libhdfs.so, where HADOOP_LOCATION is your hadoop install location, and arch is the machine’s architecture (i386-32 or amd64-64).
Once you locate the lib, make sure the H2H connector is configured correctly (see page 4 here).
It's just a matter of updating the HADOOP_LOCATION var in the config file:
/opt/HPCCSystems/hdfsconnector.conf
good luck.

Resources