Unable to run the fetcher job in Nutch deploy mode - hadoop

I've successfully run Nutch (v1.4) for a crawl using local mode on my Ubuntu 11.10 system. However, when switching over to "deploy" mode (all else being the same), I get an error during the fetch cycle.
I have Hadoop running succesfully on the machine, in a pseudo-distributed mode (replication factor is 1 and I have just 1 map and 1 reduce job setup). "jps" shows that all Hadoop daemons are up and running.
18920 Jps
14799 DataNode
15127 JobTracker
14554 NameNode
15361 TaskTracker
15044 SecondaryNameNode
I have also added the HADOOP_HOME/bin path to my PATH variable.
PATH=$PATH:/home/jimb/hadoop/bin
Then I ran the crawl from the nutch/deploy directory, as below:
bin/nutch crawl /data/runs/ar/seedurls -dir /data/runs/ar/crawls
Here is the output I get:
12/01/25 13:55:49 INFO crawl.Crawl: crawl started in: /data/runs/ar/crawls
12/01/25 13:55:49 INFO crawl.Crawl: rootUrlDir = /data/runs/ar/seedurls
12/01/25 13:55:49 INFO crawl.Crawl: threads = 10
12/01/25 13:55:49 INFO crawl.Crawl: depth = 5
12/01/25 13:55:49 INFO crawl.Crawl: solrUrl=null
12/01/25 13:55:49 INFO crawl.Injector: Injector: starting at 2012-01-25 13:55:49
12/01/25 13:55:49 INFO crawl.Injector: Injector: crawlDb: /data/runs/ar/crawls/crawldb
12/01/25 13:55:49 INFO crawl.Injector: Injector: urlDir: /data/runs/ar/seedurls
12/01/25 13:55:49 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
12/01/25 13:56:53 INFO mapred.FileInputFormat: Total input paths to process : 1
...
...
12/01/25 13:57:21 INFO crawl.Injector: Injector: Merging injected urls into crawl db.
...
12/01/25 13:57:48 INFO crawl.Injector: Injector: finished at 2012-01-25 13:57:48, elapsed: 00:01:59
12/01/25 13:57:48 INFO crawl.Generator: Generator: starting at 2012-01-25 13:57:48
12/01/25 13:57:48 INFO crawl.Generator: Generator: Selecting best-scoring urls due for fetch.
12/01/25 13:57:48 INFO crawl.Generator: Generator: filtering: true
12/01/25 13:57:48 INFO crawl.Generator: Generator: normalizing: true
12/01/25 13:57:48 INFO mapred.FileInputFormat: Total input paths to process : 2
...
12/01/25 13:58:15 INFO crawl.Generator: Generator: Partitioning selected urls for politeness.
12/01/25 13:58:16 INFO crawl.Generator: Generator: segment: /data/runs/ar/crawls/segments/20120125135816
...
12/01/25 13:58:42 INFO crawl.Generator: Generator: finished at 2012-01-25 13:58:42, elapsed: 00:00:54
12/01/25 13:58:42 ERROR fetcher.Fetcher: Fetcher: No agents listed in 'http.agent.name' property.
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Now, the configuration files for the "local" mode are setup fine (since a crawl in local mode succeeded). For running in deploy mode, since the "deploy" folder did not have any "conf" subdirectory, I assumed that either:
a) the conf files need to be copied over under "deploy/conf", OR
b) the conf files need to be placed onto HDFS.
I have verified that option (a) above does not help. So, I'm assuming that the Nutch configuration files need to exist in HDFS, for the HDFS fetcher to run successfully? However, I don't know at what path within HDFS I should place these Nutch conf files, or perhaps I'm barking up the wrong tree?
If Nutch reads config files during "deploy" mode from the files under "local/conf", then why is it that the local crawl worked fine, but the deploy-mode crawl isn't?
What am I missing here?
Thanks in advance!

Try this out:
In the nutch source directory, modify the file conf/nutch-site.xml to set http.agent.name properly.
re-build the code using ant
Go to runtime/deploy directory, set the required environment variables and try crawling again.

This is likely because you have not rebuilt yet. Can you run "ant" and see what happens? Obviously, you need to update the http.agent.name in nutch-site.xml if you have not done so yet.

Related

HDFS Data Transfer from One cluster to Another is not working with distcp

I need to transfer HDFS data from one cluster to another. I see "distcp" command to be helpful for this case. But it was not. Both cluster Namenode is privately interconnected with other datanodes. So I have two proxy machines to connect publically with the namenode. Say, I made namenode's 8070 port to run under 20000 in haproxy. Now I can ping both clusters namenode. So, I went for distcp option. There the mapreduce job starts executing for the data transfer but it is not completing.
[hdfs#ip-20-0-42-252 ~]$ hadoop distcp hdfs://YY.YY.YY.YY:20000/user/ce_prasith/filter.txt hdfs://xx.xx.xx.xx:20000/user/gl_qauser
18/10/09 10:12:15 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, append=false, useDiff=false, useRdiff=false, fromSnapshot=null, toSnapshot=null, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[hdfs:/user/ce_prasith/filter.txt], targetPath=hdfs://xx.xx.xx.xx:20000/user/gl_qauser, targetPathExists=true, filtersFile='null'}
18/10/09 10:12:16 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0
18/10/09 10:12:16 INFO tools.SimpleCopyListing: Build file listing completed.
18/10/09 10:12:16 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
18/10/09 10:12:16 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
18/10/09 10:12:16 INFO tools.DistCp: Number of paths in the copy list: 1
18/10/09 10:12:16 INFO tools.DistCp: Number of paths in the copy list: 1
18/10/09 10:12:16 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm97
18/10/09 10:12:16 INFO mapreduce.JobSubmitter: number of splits:1
18/10/09 10:12:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1539063069030_0003
18/10/09 10:12:16 INFO impl.YarnClientImpl: Submitted application application_1539063069030_0003
18/10/09 10:12:17 INFO mapreduce.Job: The url to track the job: http://ip-20-0-21-94.ec2.internal:8088/proxy/application_1539063069030_0003/
18/10/09 10:12:17 INFO tools.DistCp: DistCp job-id: job_1539063069030_0003
18/10/09 10:12:17 INFO mapreduce.Job: Running job: job_1539063069030_0003
18/10/09 10:12:22 INFO mapreduce.Job: Job job_1539063069030_0003 running in uber mode : true
18/10/09 10:12:22 INFO mapreduce.Job: map 0% reduce 0%
18/10/09 10:13:22 INFO mapreduce.Job: map 100% reduce 0%
For your information I have taken few logs of the job
2018-10-09 12:01:42,715 WARN [CommitterEvent Processor #2] org.apache.hadoop.tools.mapred.CopyCommitter: Unable to cleanup temp files
org.apache.hadoop.net.ConnectTimeoutException: Call From ip-YY.YY.YY.YY.ec2.internal/YY.YY.YY.YY to ec2-xx.xx.xx.xx.ap-south-1.compute.amazonaws.com:20000 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ec2-xx.xx.xx.xx.ap-south-1.compute.amazonaws.com/xx.xx.xx.xx:20000]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:750)
at org.apache.hadoop.ipc.Client.call(Client.java:1508)
at org.apache.hadoop.ipc.Client.call(Client.java:1441)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy10.getListing(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:573)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy11.getListing(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2101)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2084)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:731)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:110)
at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:796)
at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:792)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:792)
at org.apache.hadoop.fs.Globber.listStatus(Globber.java:76)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:237)
at org.apache.hadoop.fs.Globber.glob(Globber.java:151)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1714)
at org.apache.hadoop.tools.mapred.CopyCommitter.deleteAttemptTempFiles(CopyCommitter.java:145)
at org.apache.hadoop.tools.mapred.CopyCommitter.cleanupTempFiles(CopyCommitter.java:131)
at org.apache.hadoop.tools.mapred.CopyCommitter.abortJob(CopyCommitter.java:118)
at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobAbort(CommitterEventHandler.java:298)
at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:240)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ec2-xx.xx.xx.xx.ap-south-1.compute.amazonaws.com/xx.xx.xx.xx:20000]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:648)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:744)
at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:396)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1557)
at org.apache.hadoop.ipc.Client.call(Client.java:1480)
... 31 more
2018-10-09 12:01:42,716 INFO [CommitterEvent Processor #2] org.apache.hadoop.tools.mapred.CopyCommitter: Cleaning up temporary work folder: /user/hdfs/.staging/_distcp1087004350
I get stuck over here. Does somebody have any idea to overcome this?
All nodes in source cluster should see all nodes in destination cluster.

Nutch on windows: ERROR crawl.Injector

I'm trying to install nutch 1.12 on a windows 2012 server based on cygwin64 2.874. Due to limited skills with java and linux I followed the step by step introduction at https://wiki.apache.org/nutch/NutchTutorial#Step-by-Step:_Seeding_the_crawldb_with_a_list_of_URLs. The command
bin/nutch inject crawl/crawldb urls
throws an error because winutils.exe couldn't be found. Here is the hadoop log:
2016-07-01 09:22:25,660 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326)
at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:432)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:478)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
at org.apache.nutch.crawl.Injector.main(Injector.java:441)
2016-07-01 09:22:25,832 INFO crawl.Injector - Injector: starting at 2016-07-01 09:22:25
2016-07-01 09:22:25,832 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb
2016-07-01 09:22:25,832 INFO crawl.Injector - Injector: urlDir: urls
2016-07-01 09:22:25,832 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries.
2016-07-01 09:22:26,050 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-07-01 09:22:26,394 ERROR crawl.Injector - Injector: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
at org.apache.nutch.crawl.Injector.run(Injector.java:467)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:441)
I found several hints here and downloaded winutils.exe from https://codeload.github.com/srccodes/hadoop-common-2.2.0-bin/zip/master. I unpacked the folder on the server and set the environment variable NUTCH_OPTS=-Dhadoop.home.dir=[winutils_folder]. Now the winutils error is gone but the nutch call fails with a different error:
2016-07-01 13:24:09,697 INFO crawl.Injector - Injector: starting at 2016-07-01 13:24:09
2016-07-01 13:24:09,697 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb
2016-07-01 13:24:09,697 INFO crawl.Injector - Injector: urlDir: urls
2016-07-01 13:24:09,697 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries.
2016-07-01 13:24:09,885 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-07-01 13:24:10,307 ERROR crawl.Injector - Injector: org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
at org.apache.nutch.crawl.Injector.run(Injector.java:467)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:441)
After updating .bashrc (added the following lines) the hadoop log only shows warnings.
export JAVA_HOME='/cygdrive/c/Program Files/Java/jre1.8.0_92'
export PATH=$PATH:$JAVA_HOME/bin
But nutch still throws an error:
Injector: starting at 2016-07-01 17:25:22
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:570)
at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:173)
at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:160)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:94)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:285)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:115)
at org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:131)
at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:163)
at org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:731)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at org.apache.nutch.crawl.Injector.inject(Injector.java:376)
at org.apache.nutch.crawl.Injector.run(Injector.java:467)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:441)
I need hints what may be wrong configured or isn't it possible to run nutch with windows/cygwin?
I spent about a week to find out what causes UnsatisfiedLinkError.
I think the main reason is that java cannot find its dependencies.
I solved this problem by commenting out some scripts in nutch file.
(located in {nutch_folder}\runtime\local\bin)
To be more specific, I commented out scripts setting -Djava.library.path.
I made java program to use original java.library.path instead.
Use below command to check original java.libary.path on your environment.
java -XshowSettings:properties -version
It will show many properties including java.library.path = ...
comment out below code:
# setup 'java.library.path' for native-hadoop code if necessary
# used only in local mode
JAVA_LIBRARY_PATH=''
if [ -d "${NUTCH_HOME}/lib/native" ]; then
JAVA_PLATFORM=`"${JAVA}" -classpath "$CLASSPATH" org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'`
if [ -d "${NUTCH_HOME}/lib/native" ]; then
if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
JAVA_LIBRARY_PATH="${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}"
else
JAVA_LIBRARY_PATH="${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}"
fi
fi
fi
if [ $cygwin = true -a "X${JAVA_LIBRARY_PATH}" != "X" ]; then
JAVA_LIBRARY_PATH="`cygpath -p -w "$JAVA_LIBRARY_PATH"`"
fi
also comment out below code:
if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
NUTCH_OPTS=("${NUTCH_OPTS[#]}" -Djava.library.path="$JAVA_LIBRARY_PATH")
fi
by doing this NUTCH_OPTS will not contain -Djava.library.path, so java will use original setting.
nutch command is running like in (pseudo)-distributed mode.
I don't know why exactly.
Hope this helps.
No need to comment anything in the nutch script.
If you have placed the hadoop.dll and winutils.exe under a bin folder of your HADOOP_HOME make sure that the path to bin folder is also appended to your windows PATH env varable.

How to resolve the following apache pig error?

I am executing the following commands:
A= load 'user/cloudera' using PigStorage(':');
foreach A generate $0,$4,$5;
dump B;
On executing the last command I get the following error which I am unable to resolve.Being a newbie to bigdata and apache hadoop stack,I am unable to comprehend this error.Please help ASAP.Also searching on StackOverflow for similar errors did not help:
2015-11-13 06:36:46,170 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2015-11-13 06:36:46,208 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier, PartitionFilterOptimizer]}
2015-11-13 06:36:46,212 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2015-11-13 06:36:46,225 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2015-11-13 06:36:46,225 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2015-11-13 06:36:46,404 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2015-11-13 06:36:46,415 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2015-11-13 06:36:46,445 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-11-13 06:36:49,232 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job306801006066349255.jar
2015-11-13 06:37:04,185 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job306801006066349255.jar created
2015-11-13 06:37:04,223 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2015-11-13 06:37:04,238 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2015-11-13 06:37:04,238 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cache
2015-11-13 06:37:04,238 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2015-11-13 06:37:04,274 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-11-13 06:37:04,274 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-11-13 06:37:04,283 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2015-11-13 06:37:04,363 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-11-13 06:37:05,416 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Cleaning up the staging area /tmp/hadoop-yarn/staging/cloudera/.staging/job_1447417089361_0004
2015-11-13 06:37:05,420 [JobControl] WARN org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:cloudera (auth:SIMPLE) cause:org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://quickstart.cloudera:8020/user/cloudera/user/cloudera
2015-11-13 06:37:05,420 [JobControl] INFO org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob - PigLatin:DefaultJobName got an error while submitting
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://quickstart.cloudera:8020/user/cloudera/user/cloudera
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:288)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1306)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1303)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1303)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:191)
at java.lang.Thread.run(Thread.java:745)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:270)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/user/cloudera/user/cloudera
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:385)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
... 18 more
2015-11-13 06:37:05,423 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1447417089361_0004
2015-11-13 06:37:05,423 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases A,B
2015-11-13 06:37:05,423 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: A[3,3],B[4,3] C: R:
2015-11-13 06:37:05,423 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://localhost:50030/jobdetails.jsp?jobid=job_1447417089361_0004
2015-11-13 06:37:05,440 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-11-13 06:37:10,463 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2015-11-13 06:37:10,463 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1447417089361_0004 has failed! Stop running all dependent jobs
2015-11-13 06:37:10,463 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-11-13 06:37:10,620 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Could not get Job info from RM for job job_1447417089361_0004. Redirecting to job history server.
2015-11-13 06:37:10,844 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Could not get Job info from RM for job job_1447417089361_0004. Redirecting to job history server.
2015-11-13 06:37:10,849 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2015-11-13 06:37:10,850 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.6.0-cdh5.4.2 0.12.0-cdh5.4.2 cloudera 2015-11-13 06:36:46 2015-11-13 06:37:10 UNKNOWN
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_1447417089361_0004 A,B MAP_ONLY Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://quickstart.cloudera:8020/user/cloudera/user/cloudera
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:288)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1306)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1303)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1303)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:191)
at java.lang.Thread.run(Thread.java:745)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:270)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://quickstart.cloudera:8020/user/cloudera/user/cloudera
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:385)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274)
... 18 more
hdfs://quickstart.cloudera:8020/tmp/temp-193566860/tmp-1023933528,
Input(s):
Failed to read data from "hdfs://quickstart.cloudera:8020/user/cloudera/user/cloudera"
Output(s):
Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp-193566860/tmp-1023933528"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1447417089361_0004
2015-11-13 06:37:10,850 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2015-11-13 06:37:10,853 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias B
Details at logfile: /home/cloudera/pig_1447424730804.log
Solution:
Change line 1 of your code to:
A= load '/user/cloudera' using PigStorage(':');
Having said that, are you sure your input data is in your home area? That seems unlikely. It's more likely to be in a folder within your home area, i.e. /user/cloudera/input-data.
Before running your job do:
hdfs dfs -ls /user/cloudera
to confirm that input data is actually in that folder. If it's not, work out where it actually is, and make sure it is on the HDFS and not locally.
Explanation:
The relevant part of the logs is
ERROR 2118: Input path does not exist: hdfs://quickstart.cloudera:8020/user/cloudera/user/cloudera
This suggests it's related to the input path. The part of your code that deals with the input path is
A= load 'user/cloudera' using PigStorage(':');
By not adding a forward slash to /user then it assumes that everything is relative to your home area, so for example writing load 'input' would lead the Pig job to read in hdfs://quickstart.cloudera:8028/user/cloudera/input. In your case then the missing slash means it adds it to your user area.
Please check where are you running pig statements in local/linux mode or mapreduce mode , if you are running local mode you cant read HDFS file
run this cmd $ pig -x mapreduce
after this
Grunt> A = load '/;
Grunt>Dump/store /
Input(s):
Failed to read data from
"hdfs://quickstart.cloudera:8020/user/cloudera/user/cloudera"
Output(s):
Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp-193566860/tmp-1023933528"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
If you look at the above error code (copied from question that was posted), input file path is taken by pig as below
'hdfs://quickstart.cloudera:8020/user/cloudera/user/cloudera'
So when running pig in mapreduce mode and using quickstart VM, you have use
'hdfs://quickstart.cloudera:8020' before your input file path....
Is the input a folder?...if so pig script will read all the files in the folder....
Example:
say your input file path which is located in hdfs is '/usr/cloudera/inputdata.txt' then your query has to be
A= load 'hdfs://quickstart.cloudera:8020/user/cloudera/inputdata.txt' using PigStorage(':');
foreach A generate $0,$4,$5;
dump B;
The only thing that u have to check is the file path.
hdfs dfs -ls /
check how the path structure comes.
like ..../input/data/file.
file u are going to use.
A = load '/input/data/file';
dump A;
only thing is the proper path of your file that u want to load.
I had a similar issue and I was able to fix by deleting the passwd file from the target folder and then copying it again using the correct relative path:
A= load '/user/cloudera' using PigStorage(':');
The instruction in the video is missing the '/' before the user
Step 1:
hdfs dfs -rm /user/cloudera/passwd
Step 2:
A= load '/user/cloudera' using PigStorage(':');
Then followed the instructions from the Coursera video.

Error when crawling with Nutch - Input path does not exist: hdfs://.../urls/seed.txt

I have installed Apache Nutch and running a crawl using:
bin/crawl ./urls/seed.txt crawl http://localhost:8983/solr/ 5
Works fine from runtime/local. When I run that same command from runtime/deploy I get:
14/07/16 19:43:35 INFO crawl.InjectorJob: InjectorJob: starting at 2014-07-16 19:43:35
14/07/16 19:43:35 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: urls/seed.txt
14/07/16 19:43:37 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/07/16 19:43:37 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorType=hector
14/07/16 19:43:37 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as the Gora storage class.
14/07/16 19:43:37 INFO mapred.JobClient: Default number of map tasks: null
14/07/16 19:43:37 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 12
14/07/16 19:43:37 INFO mapred.JobClient: Default number of reduce tasks: 0
14/07/16 19:43:38 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
14/07/16 19:43:38 INFO mapred.JobClient: Setting group to hadoop
14/07/16 19:43:39 INFO mapred.JobClient: Cleaning up the staging area hdfs://172.31.13.61:9000/mnt/var/lib/hadoop/tmp/mapred/staging/hadoop/.staging/job_201407161337_0024
14/07/16 19:43:39 ERROR security.UserGroupInformation: PriviledgedActionException as:hadoop cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://172.31.13.61:9000/user/hadoop/urls/seed.txt
14/07/16 19:43:39 ERROR crawl.InjectorJob: InjectorJob: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://172.31.13.61:9000/user/hadoop/urls/seed.txt
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1016)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1033)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:951)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:904)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1140)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:904)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:501)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:531)
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:50)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
It is not locating the file seed.txt and yes it does exist in $HOME/urls/seed.txt. I am using AWS EMR and Cassandra. Any help would be greatly appreciated.

Pig DUMP gets stuck in GROUP

I'm a PIG beginner (using pig 0.10.0) and I have some simple JSON like the following:
test.json:
{
"from": "1234567890",
.....
"profile": {
"email": "me#domain.com"
.....
}
}
which i perform some group/counting in pig:
>pig -x local
with the following PIG script:
REGISTER /pig-udfs/oink.jar;
REGISTER /pig-udfs/json-simple-1.1.jar;
REGISTER /pig-udfs/guava-12.0.jar;
REGISTER /pig-udfs/elephant-bird-2.2.3.jar;
users = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') as (json:map[]);
domain_user = FOREACH users GENERATE oink.EmailDomainFilter(json#'profile'#'email') as email, json#'from' as user_id;
DUMP domain_user; /* Outputs: (domain.com,1234567890) */
grouped_domain_user = GROUP domain_user BY email;
DUMP grouped_domain_user; /* Outputs: =stuck here= */
Basically, when i try to dump the grouped_domain_user, pig gets stuck, seemly waiting for a map output to complete:
2012-05-31 17:45:22,111 [Thread-15] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local_0002_m_000000_0' done.
2012-05-31 17:45:22,119 [Thread-15] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : null
2012-05-31 17:45:22,123 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask - ShuffleRamManager: MemoryLimit=724828160, MaxSingleShuffleLimit=181207040
2012-05-31 17:45:22,125 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging on-disk files
2012-05-31 17:45:22,128 [Thread for merging in memory files] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging in memory files
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-05-31 17:45:22,129 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress
2012-05-31 17:45:22,129 [Thread for polling Map Completion Events] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-05-31 17:45:22,129 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-05-31 17:45:28,118 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:31,122 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:37,123 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:43,124 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:46,124 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:52,126 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:58,127 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:46:01,128 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
.... repeats ....
Suggestions would be welcome on why this is happening.
Thanks!
UPDATE
Chris solved this one for me. I was setting the fs.default.name, etc to correct values in the pig.properties, however i also had the HADOOP_CONF_DIR environment variable set to point to my local Hadoop installation with these sames values set with <final>true</final>.
Great find and much appreciated.
To mark this question as answered, and to those stumbling across this in future:
When running on local mode (whether that be for pig via the pig -x local, or submitting a map reduce job to the local job runner, if you are seeing the reduce phase 'hang', especially if you see entries in the log similar to:
2012-05-31 17:45:22,129 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask -
attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress
Then your job, although started in local mode, has probably switched to 'clustered' mode because the mapred.job.tracker property is marked as 'final' in your $HADOOP/conf/mapred-site.xml:
<property>
<name>mapred.job.tracker</name>
<value>hdfs://localhost:9000</value>
<final>true</final>
</property>
You should also check the fs.default.name property in core-site.xml too, and ensure it is not marked as final
This means that you are unable to set this value at runtime, and you may even see error messages similar to:
12/05/22 14:28:29 WARN conf.Configuration:
file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: fs.default.name; Ignoring.
12/05/22 14:28:29 WARN conf.Configuration:
file:/tmp/.../job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker; Ignoring.

Resources