Nutch in Windows: Failed to set permissions of path - windows

I'm trying to user Solr with Nutch on a Windows Machine and I'm getting the following error:
Exception in thread "main" java.io.IOException: Failed to set permissions of path: c:\temp\mapred\staging\admin-1654213299\.staging to 0700
From a lot of threads I learned, that hadoop which seems to be used by nutch does some chmod magic that will work on Unix machines, but not on Windows.
This problem exists for more than a year now. I found one thread, where the code line is shown and a fix proposed. Am I really them only one who has this problem? Are all others creating a custom build in order to run nutch on windows? Or is there some option to disable the hadoop stuff or another solution? Maybe another crawler than nutch?
Here's the stack trace of what I'm doing:
admin#WIN-G1BPD00JH42 /cygdrive/c/solr/apache-nutch-1.6
$ bin/nutch crawl urls -dir crawl -depth 3 -topN 5 -solr http://localhost:8080/solr-4.1.0
cygpath: can't convert empty path
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=http://localhost:8080/solr-4.1.0
topN = 5
Injector: starting at 2013-03-03 17:43:15
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Failed to set permissions of path: c:\temp\mapred\staging\admin-1654213299\.staging to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

It took me a while to get this working but here's the solution which works on nutch 1.7.
Download Hadoop Core 0.20.2 from the maven repository
Replace $NUTCH_HOME/lib/hadoop-core-1.2.0.jar with the downloaded file renaming it with the same name.
That should be it.
Explanation
This issue is caused by hadoop since it assumes you're running on unix and abides by the file permission rules. The issue was resolved in 2011 actually but nutch didn't update the hadoop version they use. The relevant fixes are here and here

We are using Nutch too, but it is not supported for running on Windows, on Cygwin our 1.4 version had similar problems as you had, something like mapreduce too.
We solved it by using a vm (Virtual box) with Ubuntu and a shared directory between Windows and Linux, so we can develop and built on Windows and run Nutch (crawling) on Linux.

I have Nutch running on windows, no custom build. It's a long time since I haven't used it though. But one thing that took me a while to catch, is that you need to run cygwin as a windows admin to get the necessary rights.

I suggest a different approach. Check this link out. It explains how to swallow the error on Windows, and does not require you to downgrade Hadoop or rebuild Nutch. I tested on Nutch 2.1, but it applies to other versions as well.
I also made a simple .bat for starting the crawler and indexer, but it is meant for Nutch 2.x, might not be applicable for Nutch 1.x.
For the sake of posterity, the approach entails:
Making a custom LocalFileSystem implementation:
public class WinLocalFileSystem extends LocalFileSystem {
public WinLocalFileSystem() {
super();
System.err.println("Patch for HADOOP-7682: "+
"Instantiating workaround file system");
}
/**
* Delegates to <code>super.mkdirs(Path)</code> and separately calls
* <code>this.setPermssion(Path,FsPermission)</code>
*/
#Override
public boolean mkdirs(Path path, FsPermission permission)
throws IOException {
boolean result=super.mkdirs(path);
this.setPermission(path,permission);
return result;
}
/**
* Ignores IOException when attempting to set the permission
*/
#Override
public void setPermission(Path path, FsPermission permission)
throws IOException {
try {
super.setPermission(path,permission);
}
catch (IOException e) {
System.err.println("Patch for HADOOP-7682: "+
"Ignoring IOException setting persmission for path \""+path+
"\": "+e.getMessage());
}
}
}
Compiling it and placing the JAR under ${HADOOP_HOME}/lib
And then registering it by modifying ${HADOOP_HOME}/conf/core-site.xml:
fs.file.impl
com.conga.services.hadoop.patch.HADOOP_7682.WinLocalFileSystem
Enables patch for issue HADOOP-7682 on Windows

You have to change the project dependences hadoop-core and hadoop-tools. I'm using 0.20.2 version and works fine.

Related

HBaseTestingUtility failing on Windows 10 with UnsatisfiedLinkError

I'm trying to get the HBaseTestingUtility running on Windows 10.
I'm using hbase-client and hbase-testing-util with version 1.4.2.
When running:
HBaseTestingUtility hbaseUtility = new HBaseTestingUtility();
hbaseUtility.startMiniCluster(); //<- error thrown on this line
I get the below error:
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:609)
at org.apache.hadoop.fs.FileUtil.canWrite(FileUtil.java:996)
...
I have downloaded winutils, and have set the following user variables:
hadoop.home.dir=C:\Users\bwatson\apps\hadoop-2.8.3
HADOOP_HOME=C:\Users\bwatson\apps\hadoop-2.8.3
but this does not make a difference.
The official documentation for the HBaseTestingUtility says that Cygwin is needed on Windows, but I cannot install that due to the admin restrictions on my work machine. Is there any other solution?
After some digging, I found a solution in https://stackoverflow.com/a/43484457/729819. I %HADOOP_HOME%/bin to PATH. Now I get another error but will raise another question for that.

Error with nutch 1.11 : ....org.apache.hadoop.fs.FileStatus.isDirectory()Z

I want to make an application in java like Google news.
For that I am doing that from scratch and doing basic setup with Nutch.
I am done with installation but getting error in one command.
Here is brief about tech. I am using
-nutch 1.11
-Cygwin
My first command was :
$ bin/nutch
which gives me perfect output.
Then I did URI crawling like :
$ bin/nutch inject crawl/crawldb urls
Which created crawldb folder and crawl given url
Now I want to generate segments and which gives me given Error :
$ bin/nutch generate crawl/crawldb crawl/segments
Generator: starting at 2016-04-14 17:30:29
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20160414173032
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.isDirectory()Z
at org.apache.nutch.util.LockUtil.removeLockFile(LockUtil.java:79)
at org.apache.nutch.crawl.Generator.generate(Generator.java:637)
at org.apache.nutch.crawl.Generator.run(Generator.java:743)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Generator.main(Generator.java:699)
I am not getting the problem. Is there mismatch between jars or having any other problem....
Did you built nutch by yourself or used the packaged version? I've just checkout the 1.11 branch of the Nutch repo and built it, executing your commands give the right output with no exception at all. Granted I've tested this on my local system (OS X) which is not windows/cygwin, but this shouldn't be a problem.
The 1.11 nutch branch is using hadoop 2.4.0, you can checkout which versions of hadoop are being pulled from the maven repo in the runtime/local/lib/ folder, check the hadoop-* files.

Spring-xd strange too many open files error

I upgraded from spring-xd 1.2.1 to 1.3.0, and have both under /opt on my system. After starting xd in single node (but configured to use Zookeeper), I tried to create another stream (e.g. "time | log"), and spring-xd throws the following exception:
java.io.FileNotFoundException: /opt/spring-xd-1.2.1.RELEASE/xd/config/modules/modules.yml (Too many open files)
I changed ulimit -n 60000, but it didn't solve the problem. The strange thing is why it still points to spring-xd-1.2.1.RELEASE? I have started both xd-singlenode and xd-shell under /opt/spring-xd-1.3.1.RELEASE
EDIT: add xd-singlenode running process output just to show it's pointing to 1.3.1:
/usr/java/default/bin/java -Dspring.application.name=admin
-Dlogging.config=file:/opt/spring-xd-1.3.0.RELEASE/xd/config//
/xd-singlenode-logback.groovy -Dxd.home=/opt/spring-xd-1.3.0.RELEASE/xd
-Dspring.config.location=file:/opt/spring-xd-1.3.0.RELEASE/xd/config//
-Dxd.config.home=file:/opt
/spring-xd-1.3.0.RELEASE/xd/config//
-Dspring.config.name=servers,application
-Dxd.module.config.location=file:/opt/spring-xd-1.3.0.RELEASE/xd/config//modules/
-Dxd.module.config.name=modules -classpath
/opt/spring-xd-1.3.0.RELEASE/xd/modules/processor/scripts:/opt/spring-xd
-1.3.0.RELEASE/xd/config:/opt/spring-xd-1.3.0.RELEASE/xd/lib/activation-
...
have you updated your environment variables? specifically XD_CONFIG_LOCATION based on the error shown above.

MRUnit test case for Driver

I have written MRunit with following code:
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "file:///");
conf.set("fs.default.name", "file:///");
conf.set("mapreduce.framework.name", "local");
conf.setInt("mapreduce.task.io.sort.mb", 1);
Path input = new Path("input/ncdc/micro");
Path output = new Path("output");
FileSystem fs = FileSystem.getLocal(conf);
fs.delete(output, true); // delete old output
VisitedItemFlattenDriver driver = new VisitedItemFlattenDriver();
driver.setConf(conf);
int exitCode = driver.run(new String[] {
input.toString(), output.toString(), "false" });
But when I execute the Junit test case from eclipse. I 'm getting exception as below:-
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:441)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:435)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:277)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:344)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1286)
at biz.ds.www.preprocess.visiteditem.VisitedItemFlattenDriver.run(VisitedItemFlattenDriver.java:69)
I'm not sure what is causing this error as I just intent to unit test my class:
public class VisitedItemFlattenDriver extends Configured implements Tool {
...}
I deeply appreciate if some one when guide how to resolve the error.
I tried a couple of options to resolve the problem and spend many hours to do so..
Firstly, I searched for an option and found to add winutils.exe, and .dll files to hadoop/bin. I tried the step and also set HADOOP_HOME environment variable.
Somehow above mentioned error got resolved and I was then stuck in to an different error like below:
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
It was obvious that error is due to some compatibility issues. But then I did some searching and found that it can be resolved if we upgrade JRE from 32 bit to 64 bit.
Earlier I was using JDK 6 32 bit and then I updated it to JDK 6 64 bit. It did not resolved my problem. I also tried to use minidfscluster for MR unit but that to gave same error.
But then I used JDK 7 64 bit for my code and the problem was resolved and it ran successfully.
** Note: I 'm using Hadoop version 2.2.0

MapReduce: Passing external jar files using libjars option does not work

My map reduce program needs external jar files. I am using the
"-libjars" option to provide those external jar files -
I used Tool, Configured and ToolRunner Utilities provided by hadoop.
public static void main(String[] args)throws Exception {
int res = ToolRunner.run(newConfiguration(), new MapReduce(),args);
System.exit(res);
}
#Override
public int run(String[] args) throwsException {
// Configuration processed by ToolRunner
Configuration conf = getConf();
Job job = new Job (conf, "MapReduce");
....
}
When I tried to run the job -
$ Hadoop jar myjob.jar jobClassName -libjars external.jar
It threw the following exception.
12/11/21 16:26:02 INFO mapred.JobClient: Task Id :
attempt_201211211620_0001_m_000000_1, Status : FAILED Error:
java.lang.ClassNotFoundException:
org.joda.time.format.DateTimeFormatterBuilder
I have been trying to resolve it for a while. Nothing seems to work so far. I am using CDH 4.1.1.
It seems cannot find JodaTime. Open /etc/hbase/hbase-env.sh and add your extra jar to HADOOP_CLASSPATH.
export HADOOP_CLASSPATH="<extra_entries>:$HADOOP_CLASSPATH"
Another, less efficient and sometimes not possible, idea is to copy your requited jar to /usr/share/hadoop/lib.
Try invoking the command using the fully qualified absolute file name for the external.jar. Also confirm that the missing class and all of its prerequisite classes are in the external.jar.

Resources