Is it possible to run Hadoop jobs (like the WordCount sample) in the local mode on Windows without Cygwin? - maven

I have Windows 7, Java 8, Maven and Eclipse.
I've created a Maven project and used almost exactly the same code as here.
It's just a simple "word count" sample.
I try to launch the "driver" program from Eclipse, I provide command line arguments (the input file and the output directory) and get the following error:
Exception in thread "main" java.lang.NullPointerException at
java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at
org.apache.hadoop.util.Shell.runCommand(Shell.java:404) at
org.apache.hadoop.util.Shell.run(Shell.java:379) at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at
org.apache.hadoop.util.Shell.execCommand(Shell.java:678) at
org.apache.hadoop.util.Shell.execCommand(Shell.java:661) at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639) at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:435) at
org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:277) at
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125) at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:344) at
org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268) at
org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at
org.apache.hadoop.mapreduce.Job.submit(Job.java:1265) at
org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1286) at
misc.projects.hadoop.exercises.WordCountDriverApp.main(WordCountDriverApp.java:29)
The failing line (WordCountDriverApp.java:29) contains the command to launch the job:
job.waitForCompletion(true)
I want to make it work and therefore I want to understand something:
Do I have to provide any hdfs-site.xml, yarn-site.xml, ... all this, if I want just the local mode (without any cluster)?
I don't have these XML config files now. As far as I remember, the defaults are all OK for the local mode, maybe I am wrong.
Is it possible at all under Windows (to launch any Hadoop jobs whatsoever) or the whole Hadoop thing is Linux-only?
P.S.:
The Hadoop dependency is the following:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>

Download Hadoop 2.6.0 or 2.7.1 compiled for Windows
Create HADOOP_HOME environment variable pointing to the unzipped dir
Add %HADOOP_HOME%\bin to PATH env var
Source: https://stackoverflow.com/a/27394808/543836

Hadoop runs on Windows, it is possible, but you'll grow white hair if you try to pull it off on your own.
To start with, all filesystem operations in Windows Hadoop are routed either through the NativeIO, if available, or via winutils if NativeIO is not loaded. In your case it took the winutils path. You could make NativeIO available if you instruct Eclipse where to find it. See How to add native library to “java.library.path” with Eclipse launch (instead of overriding it), you need to add the location of that hadoop-common-project project target's bin, where you'll find hadoop.dll which hosts the NativeIO. But even after that, you'll still need wintils for container launch. The winutils.exe will be in that same location (the hadoop-common target/bin), but the code looks for it based on %HADOOP_HOME%, so you'll have to define that. And it will go uphill from there. I intentionally omitted the details how to configure all these because I don't think you should, or to be more precise, you should only if you understand how to do it.
It would be much much easier if you take an off-the-shelf Hadoop distribution for Windows, of which there are exactly one: the HDP from Hortonworks, download it, install it, configure it and then run against the 'cluster'.

Related

Error in hadoop examples.jar

I just installed Hadoop from the yahoo developers network running on a vm. I ran the following code after start-all.sh after cd-ing to the bin folder
hadoop jar hadoop-0.19.0.-examples.jar pi 10 1000000
I'm getting
java. io.IOException:Error opening jon jar:hadoop-0.18.0-examples.jar
at org.apache.hadoop.util.main(RunJar.java:90) at
org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at
org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) caused
by:java.util.ZipExcaption:error in opening zip file
How do i sort this out?
Please make sure that have below things in place
Your examples.jar file is present in the path where you are running the above command. else you need to give complete path for the jar file.
hadoop jar /usr/lib/hadoop-mapreduce/*example.jar pi 10 100000
It has appropriate read permissions for the user that you are using to run the hadoop job.
If you still face issue, please update logs in your question.
You will face this issue if you are using older version of the java . Hadoop needs Java 7 or Java 8. Please check your JAVA version and update if needed.

How to eliminate Error util.Shell: Failed to locate the winutils binary

I am executing a remote job from a windows machine(the client) under eclipse, I clarify that I dont have any hadoop installation on my windows client, and I dont needed, I am executing the hadoop job remotely, and hadoop is installed on a linux machine.
Everything is executed correctly, but I would like to get rid of this ERROR:
14/09/22 11:49:49 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:355)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:370)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:363)
at sun.misc.Unsafe.ensureClassInitialized(Native Method)
at sun.reflect.UnsafeFieldAccessorFactory.newFieldAccessor(Unknown Source)
at sun.reflect.ReflectionFactory.newFieldAccessor(Unknown Source)
at java.lang.reflect.Field.acquireFieldAccessor(Unknown Source)
at java.lang.reflect.Field.getFieldAccessor(Unknown Source)
at java.lang.reflect.Field.set(Unknown Source)
at MyFirstJob.main(MyFirstJob.java:45)
Do you know how to make this exception not hapenning ?
Install the winutils.exe, there is no other way of fixing this error.
Here is a little context: Hadoop will write some files locally (e.g. the job configs) before uploading them to the cluster. Thus it will need to set permissions, write some files or create directories.
In case it doesn't find the binary, it will fallback to the Java implementations anyway, so you don't need to worry. However, there is no built-in configuration to turn this message off, so the only way to really fixing it is to recompile your hadoop-common jar without this error (I guess installing winutils isn't that bad compared to it).
Copy org.apache.hadoop.util.Shell.java into your project.
You can comment out the below line,to remove the Error.
throw new IOException("Could not locate executable " + fullExeName + " in the Hadoop binaries.");
Also for Windows check,
Error while running Mapreduce(yarn)from windows eclipse
I saw a suggestion somewhere to just create an empty file with that name, to get rid of the error. I think I tried it once and it worked - feel free to try if it works for you. The file can be created on-the-fly if needed.

Not able to run Hadoop job remotely

I want to run a hadoop job remotely from a windows machine. The cluster is running on Ubuntu.
Basically, I want to do two things:
Execute the hadoop job remotely.
Retrieve the result from hadoop output directory.
I don't have any idea how to achieve this. I am using hadoop version 1.1.2
I tried passing jobtracker/namenode URL in the Job configuration but it fails.
I have tried the following example : Running java hadoop job on local/remote cluster
Result: Getting error consistently as cannot load directory. It is similar to this post:
Exception while submitting a mapreduce job from remote system
Welcome to a world of pain. I've just implemented this exact use case, but using Hadoop 2.2 (the current stable release) patched and compiled from source.
What I did, in a nutshell, was:
Download the Hadoop 2.2 sources tarball to a Linux machine and decompress it to a temp dir.
Apply these patches which solve the problem of connecting from a Windows client to a Linux server.
Build it from source, using these instructions. It will also ensure that you have 64-bit native libs if you have a 64-bit Linux server. Make sure you fix the build files as the post instructs or the build would fail. Note that after installing protobuf 2.5, you have to run sudo ldconfig, see this post.
Deploy the resulted dist tar from hadoop-2.2.0-src/hadoop-dist/target on the server node(s) and configure it. I can't help you with that since you need to tweak it to your cluster topology.
Install Java on your client Windows machine. Make sure that the path to it has no spaces in it, e.g. c:\java\jdk1.7.
Deploy the same Hadoop dist tar you built on your Windows client. It will contain the crucial fix for the Windox/Linux connection problem.
Compile winutils and Windows native libraries as described in this Stackoverflow answer. It's simpler than building entire Hadoop on Windows.
Set up JAVA_HOME, HADOOP_HOME and PATH environment variables as described in these instructions
Use a text editor or unix2dos (from Cygwin or standalone) to convert all .cmd files in the bin and etc\hadoop directories, otherwise you'll get weird errors about labels when running them.
Configure the connection properties to your cluster in your config XML files, namely fs.default.name, mapreduce.jobtracker.address, yarn.resourcemanager.hostname and the alike.
Add the rest of the configuration required by the patches from item 2. This is required for the client side only. Otherwise the patch won't work.
If you've managed all of that, you can start your Linux Hadoop cluster and connect to it from your Windows command prompt. Joy!

Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses-submiting job2remoteClustr

I recently upgraded my cluster from Apache Hadoop1.0 to CDH4.4.0. I have a weblogic server in another machine from where i submit jobs to this remote cluster via mapreduce client. I still want to use MR1 and not Yarn. I have compiled my client code against the client jars in the CDH installtion (/usr/lib/hadoop/client/*)
Am getting the below error when creating a JobClient instance. There are many posts related to the same issue but all the solutions refer to the scenario of submitting the job to a local cluster and not to remote and specifically in my case from a wls container.
JobClient jc = new JobClient(conf);
Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
But running from the command prompt on the cluster works perfectly fine.
Appreciate your timely help!
I had a similar error and added the following jars to classpath and it worked for me:
hadoop-mapreduce-client-jobclient-2.2.0.2.0.6.0-76:hadoop-mapreduce-client-shuffle-2.3.0.jar:hadoop-mapreduce-client-common-2.3.0.jar
It's likely that your app is looking at your old Hadoop 1.x configuration files. Maybe your app hard-codes some config? This error tends to indicate you are using the new client libraries but that they are not seeing new-style configuration.
It must exist since the command-line tools see them fine. Check your HADOOP_HOME or HADOOP_CONF_DIR env variables too although that's what the command line tools tend to pick up, and they work.
Note that you need to install the 'mapreduce' service and not 'yarn' in CDH 4.4 to make it compatible with MR1 clients. See also the '...-mr1-...' artifacts in Maven.
In my case, this error was due to the version of the jars, make sure that you are using the same version as in the server.
export HADOOP_MAPRED_HOME=/cloudera/parcels/CDH-4.1.3-1.cdh4.1.3.p0.23/lib/hadoop-0.20-mapreduce
I my case i was running sqoop 1.4.5 and pointing it to the latest hadoop 2.0.0-cdh4.4.0 which had the yarn stuff also thats why it was complaining.
When i pointed sqoop to hadoop-0.20/2.0.0-cdh4.4.0 (MR1 i think) it worked.
As with Akshay (comment by Setob_b) all I needed to fix was to get hadoop-mapreduce-client-shuffle-.jar on my classpath.
As follows for Maven:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-shuffle</artifactId>
<version>${hadoop.version}</version>
</dependency>
In my case, strangely this error was because in my 'core-site.xml' file, I mentioned "IP-address" rather than "hostname".
The moment I mentioned "hostname" in place of IP address and in "core-site.xml" and "mapred.xml" and re-installed mapreduce lib files, error got resolved.
in my case, i resolved this by using hadoop jar instead of java -jar .
it's usefull, hadoop will provide the configuration context from hdfs-site.xml, core-site.xml ....

Interfacing Hadoop to Cassandra on Amazon AWS - netty version conflict?

I've got a Hadoop map reduce class that runs on Amazon EMR and outputs to an HDFS flatfile. All well and good, but now I need to output to a Cassandra DB, also running on AWS. I built and ran a local client and got that working, then moved the Cassandra writing code to my Hadoop project. The problem, it seems, is that Amazon draws in /home/hadoop/lib/netty-3.2.4.Final.jar for Hadoop 1.0.3, but the Cassandra that runs on AWS is 1.2.6 and uses netty-3.5.9.Final.jar.
What can I do to prevent or circumvent this conflict? Can I draw in my newer version of netty along side the one Amazon EMR draws in?
The error I get from running the jar on EMR is as follows:
Exception in thread "main" java.lang.NoSuchMethodError: org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.<init>(IIIIIZ)V
at org.apache.cassandra.transport.Frame$Decoder.<init>(Frame.java:147)
at com.datastax.driver.core.Connection$PipelineFactory.getPipeline(Connection.java:616)
at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:212)
at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at com.datastax.driver.core.Connection.<init>(Connection.java:111)
at com.datastax.driver.core.Connection.<init>(Connection.java:56)
at com.datastax.driver.core.Connection$Factory.open(Connection.java:387)
at com.datastax.driver.core.ControlConnection.tryConnect(ControlConnection.java:211)
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:174)
at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:87)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:609)
at com.datastax.driver.core.Cluster$Manager.access$100(Cluster.java:553)
at com.datastax.driver.core.Cluster.<init>(Cluster.java:67)
at com.datastax.driver.core.Cluster.buildFrom(Cluster.java:94)
at com.datastax.driver.core.Cluster$Builder.build(Cluster.java:534)
at GraphAnalysis.MatrixBuilder.compileOutput(MatrixBuilder.java:282)
at GraphAnalysis.MatrixBuilder.main(MatrixBuilder.java:205)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
I had a similar problem with one of our internal modules which had used netty-3.2.4.jar which had a missing function.
Ultimately I had to do open-heart surgery on the jar file.
Expand the jar file that embeds the old netty library using jar
-xvf oldlibrary.jar
Delete the folder org/jboss/netty.
Then expand the new netty jar file into a separate folder
Copy the new org/jboss/netty folder along with its contents to the old location cp -prf netty .
Create a new JAR package jar -cvf ../new_jarfile.jar *
This is a rare case, but you can apply this method to overcome library incompatibilities whenever they occur. Make sure that your program still runs. If you change the underlying library this way, there are changes that your original program will not run. A well-designed library should be agnostic of such changes though.

Resources