Unknown issue in Nutch elastic indexer with nutch REST api - elasticsearch

I was trying to expose nutch using REST endpoints and ran into an issue in indexer phase. I'm using elasticsearch index writer to index docs to ES. I've used $NUTCH_HOME/runtime/deploy/bin/nutch startserver command. While indexing an unknown exception is thrown.
Error:
com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
16/10/07 16:01:47 INFO mapreduce.Job: map 100% reduce 0% 16/10/07
16:01:49 INFO mapreduce.Job: Task Id :
attempt_1475748314769_0107_r_000000_1, Status : FAILED Error:
com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
16/10/07 16:01:53 INFO mapreduce.Job: Task Id :
attempt_1475748314769_0107_r_000000_2, Status : FAILED Error:
com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
16/10/07 16:01:58 INFO mapreduce.Job: map 100% reduce 100% 16/10/07
16:01:59 INFO mapreduce.Job: Job job_1475748314769_0107 failed with
state FAILED due to: Task failed task_1475748314769_0107_r_000000 Job
failed as tasks failed. failedMaps:0 failedReduces:1
ERROR indexer.IndexingJob: Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865) at
org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) at
org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at
org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
Failed with exit code 255.
Any help would be appreciated.
PS : After debugging using stack trace I think the issue is due to mismatch in guava version. I've tried changing build.xml of plugins(parse-tika and parsefilter-naivebayes) but it didn't work.

I have found solution for this issue. This is due to the version compatibility of guava dependency. Hadoop uses guava-11.0.2.jar as dependency. But the elastic indexer plugin in nutch requires 18.0 version of guava. That's why it is throwing an exception when trying to run in distributed hadoop. So we just need to update guava version to 18.0 in hadoop libs(can be found at $HADOOP_HOME/share/hadoop/common/libs/).

Related

Unable to find SASL server implementation?

There's No issue with java version
The mapper phase has begun if there were issue related to version it would have thrown earlier
Its throwing some SASL Exception ?
Here are the errors.
Mapper face has already begun but it's not able to proceed further due to SASL?
2018-06-17 11:15:54,420 INFO mapreduce.Job: map 0% reduce 0%
2018-06-17 11:15:54,440 INFO mapreduce.Job: Job job_1529225370089_0093 failed with state FAILED due to: Application application_1529225370089_0093 failed 2 times due to Error launching appattempt_1529225370089_0093_000002. Got exception: org.apache.hadoop.security.AccessControlException: Unable to find SASL server implementation for DIGEST-MD5

S3distcp on local hadoop cluster not working

I am trying to run s3distcp from my local hadoop pseudo cluster. As a result of executing s3distcp.jar i received the following stack-trace . It seems that reducer task is failing but I am not able to pinpoint the reason which could be causing reducer to fail :-
18/02/21 12:14:01 WARN mapred.LocalJobRunner: job_local639263089_0001
java.lang.Exception: java.lang.RuntimeException: Reducer task failed to copy 1 files: file:/home/chirag/workspaces/lzo/data-1518765365022.lzo etc
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:556)
Caused by: java.lang.RuntimeException: Reducer task failed to copy 1 files: file:/home/chirag/workspaces/lzo/data-1518765365022.lzo etc
at com.amazon.external.elasticmapreduce.s3distcp.CopyFilesReducer.close(CopyFilesReducer.java:70)
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:250)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:346)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/02/21 12:14:02 INFO mapreduce.Job: Job job_local639263089_0001 running in uber mode : false
18/02/21 12:14:02 INFO mapreduce.Job: map 100% reduce 0%
18/02/21 12:14:02 INFO mapreduce.Job: Job job_local639263089_0001 failed with state FAILED due to: NA
18/02/21 12:14:02 INFO mapreduce.Job: Counters: 35
I'm getting the same error. In my case, I found logs in HDFS /var/log/hadoop-yarn/apps/hadoop/logs related to the MR job that s3-dist-cp kicks off.
hadoop fs -ls /var/log/hadoop-yarn/apps/hadoop/logs
I copied them out to local:
hadoop fs -get /var/log/hadoop-yarn/apps/hadoop/logs/application_nnnnnnnnnnnnn_nnnn/ip-nnn-nn-nn-nnn.ec2.internal_nnnn
And then examined them in a text editor to find more diagnostic information about the detailed results of the Reducer phase. In my case I was getting an error back from the S3 service. You might find a different error.

Hadoop-2.5.1 + Nutch-2.2.1: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

Command: ./crawl /urls /mydir XXXXX 2
When I run this command in Hadoop-2.5.1 and Nutch-2.2.1, I get the wrong information as following.
14/10/07 19:58:10 INFO mapreduce.Job: Running job: job_1411692996443_0016
14/10/07 19:58:17 INFO mapreduce.Job: Job job_1411692996443_0016 running in uber mode : false
14/10/07 19:58:17 INFO mapreduce.Job: map 0% reduce 0%
14/10/07 19:58:21 INFO mapreduce.Job: Task Id : attempt_1411692996443_0016_m_000000_0, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
14/10/07 19:58:26 INFO mapreduce.Job: Task Id : attempt_1411692996443_0016_m_000000_1, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
14/10/07 19:58:31 INFO mapreduce.Job: Task Id : attempt_1411692996443_0016_m_000000_2, Status : FAILED
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
14/10/07 19:58:36 INFO mapreduce.Job: map 100% reduce 0%
14/10/07 19:58:36 INFO mapreduce.Job: Job job_1411692996443_0016 failed with state FAILED due to: Task failed task_1411692996443_0016_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
14/10/07 19:58:36 INFO mapreduce.Job: Counters: 12
Job Counters
Failed map tasks=4
Launched map tasks=4
Other local map tasks=3
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=11785
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=11785
Total vcore-seconds taken by all map tasks=11785
Total megabyte-seconds taken by all map tasks=12067840
Map-Reduce Framework
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
14/10/07 19:58:36 ERROR crawl.InjectorJob: InjectorJob: java.lang.RuntimeException: job failed: name=[/mydir]inject /urls, jobid=job_1411692996443_0016
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:55)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Probably you are using Gora (or smth else) compiled with Hadoop 1 (from maven repo?). You can download Gora (0.5?) and build it with Hadoop 2.
Perhaps it is just the first trouble in the series of problems.
Please notify us about your future steps.
I had similar error on nutch 2.x with hadoop 2.4.0
Recompile nutch with hadoop 2.5.1 dependencies (ivy) and exclude all hadoop 1.x dependencies - you can find them in lib - probably hadoop-core.

Error in configuring object when converting intoTika using Behemoth and map reduce

I am running the command to convert behemoth corpus to tika using map reduce as given in this tutorial
I am getting following error on doing it:
13/02/25 14:44:00 INFO mapred.FileInputFormat: Total input paths to process : 1
13/02/25 14:44:01 INFO mapred.JobClient: Running job: job_201302251222_0017
13/02/25 14:44:02 INFO mapred.JobClient: map 0% reduce 0%
13/02/25 14:44:09 INFO mapred.JobClient: Task Id : attempt_201302251222_0017_m_000000_0, Status : FAILED
java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
attempt_201302251222_0017_m_000001_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201302251222_0017_m_000001_0: log4j:WARN Please initialize the log4j system properly.
13/02/25 14:44:14 INFO mapred.JobClient: Task Id : attempt_201302251222_0017_m_000001_1, Status : FAILED
java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
I am not able to understand the exact problem.What could be the possible reasons?Do i need to add copy any jar from Behemoth/Tika to hadoop working directory?
I had the same problem.
The procedure, as described on this page has helped me.
After I run "mvn clean install", the tika job worked as described in the tutorial.
The tutorial you mentioned is outdated. See tutorial on wiki which is the reference.
The logs don't give any useful information as to what the problem can be but all you'd need to get Behemoth working is the job files for each module. If you have Hadoop running on a server simply use the Hadoop command on the job files or for simplicity use the behemoth script.
BTW the DigitalPebble mailing list would be a better place to ask questions about Behemoth
HTH
Julien

Creation of symlink from job logs to ${hadoop.tmp.dir} failed in hadoop multinode cluster setup

When I run simple wordcount example in 3 node clustered hadoop, I got the following error. I checked all write/read permissions of necessary folders. This error does not stop mapreduce job but all workload gone to one machine in the cluster, other two machines gives same error above when a task arrives them.
12/09/13 09:38:37 INFO mapred.JobClient: Task Id : attempt_201209121718_0006_m_000008_0,Status : FAILED
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Creation of symlink from /hadoop/libexec/../logs/userlogs/job_201209121718_0006/attempt_201209121718_0006_m_000008_0 to /hadoop/hadoop-datastore
/mapred/local/userlogs/job_201209121718_0006/attempt_201209121718_0006_m_000008_0 failed.
at org.apache.hadoop.mapred.TaskLog.createTaskAttemptLogDir(TaskLog.java:110)
at org.apache.hadoop.mapred.DefaultTaskController.createLogDir(DefaultTaskController.java:71)
at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:316)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:228)
12/09/13 09:38:37 WARN mapred.JobClient: Error reading task outputhttp://peter:50060/tasklog?plaintext=true&attemptid=attempt_201209121718_0006_m_000008_0&filter=stdout
12/09/13 09:38:37 WARN mapred.JobClient: Error reading task outputhttp://peter:50060/tasklog?plaintext=true&attemptid=attempt_201209121718_0006_m_000008_0&filter=stderr
What is that error about?
java.lang.Throwable: Child Error
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
It seems the memory allocated for the tasks trackers is more than the nodes actual memory. Check this link Explanation

Resources