Hadoop can list s3 contents but spark-shell throws ClassNotFoundException - hadoop

My saga continues -
In short I'm trying to create a teststack for spark - aim being to read a file from an s3 bucket and then write it to another. Windows env.
I was repeatedly encountering errors when trying to access S3 or S3n as a ClassNotFoundException was being thrown. These classes were added to the core-site.xml as the s3 and s3n.impl
I added the hadoop/share/tools/lib to the classpath to no avail, I then added the aws-java-jdk and hadoop-aws jars to the share/hadoop/common folder and I am now able to list the contents of a bucket using haddop on the command line.
hadoop fs -ls "s3n://bucket" shows me the contents, this is great news :)
In my mind the hadoop configuration should be picked up by spark so solving one should solve the other however when I run spark-shell and try to save a file to s3 I get the usual ClassNotFoundException as shown below.
I'm still quite new to this and unsure if I've missed something obvious, hopefully someone can help me solve the riddle? Any help is greatly appreciated, thanks.
The exception:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
my core-site.xml(which I believe to be correct now as hadoop can access s3):
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>
<property>
<name>fs.s3n.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
<description>The FileSystem for s3n: (Native S3) uris.</description>
</property>
and finally the hadoop-env.cmd showing the classpath(which is seemingly ignored):
set HADOOP_CONF_DIR=C:\Spark\hadoop\etc\hadoop
#rem ##added as s3 filesystem not found.http://stackoverflow.com/questions/28029134/how-can-i-access-s3-s3n-from-a-local-hadoop-2-6-installation
set HADOOP_USER_CLASSPATH_FIRST=true
set HADOOP_CLASSPATH=%HADOOP_CLASSPATH%:%HADOOP_HOME%\share\hadoop\tools\lib\*
#rem Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.
if exist %HADOOP_HOME%\contrib\capacity-scheduler (
if not defined HADOOP_CLASSPATH (
set HADOOP_CLASSPATH=%HADOOP_HOME%\contrib\capacity-scheduler\*.jar
) else (
set HADOOP_CLASSPATH=%HADOOP_CLASSPATH%;%HADOOP_HOME%\contrib\capacity-scheduler\*.jar
)
)
EDIT: spark-defaults.conf
spark.driver.extraClassPath=C:\Spark\hadoop\share\hadoop\common\lib\hadoop-aws-2.7.1.jar:C:\Spark\hadoop\share\hadoop\common\lib\aws-java-sdk-1.7.4.jar
spark.executor.extraClassPath=C:\Spark\hadoop\share\hadoop\common\lib\hadoop-aws-2.7.1.jar:C:\Spark\hadoop\share\hadoop\common\lib\aws-java-sdk-1.7.4.jar

You need to pass some parameters to your spark-shell. Try this flag --packages org.apache.hadoop:hadoop-aws:2.7.2 .

Related

How can I access S3/S3n from a local Hadoop 2.6 installation?

I am trying to reproduce an Amazon EMR cluster on my local machine. For that purpose, I have installed the latest stable version of Hadoop as of now - 2.6.0.
Now I would like to access an S3 bucket, as I do inside the EMR cluster.
I have added the aws credentials in core-site.xml:
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>some id</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>some id</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>some key</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>some key</value>
</property>
Note: Since there are some slashes on the key, I have escaped them with %2F
If I try to list the contents of the bucket:
hadoop fs -ls s3://some-url/bucket/
I get this error:
ls: No FileSystem for scheme: s3
I edited core-site.xml again, and added information related to the fs:
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>
<property>
<name>fs.s3n.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>
This time I get a different error:
-ls: Fatal internal error
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
Somehow I suspect the Yarn distribution does not have the necessary jars to be able to read S3, but I have no idea where to get those. Any pointers in this direction would be greatly appreciated.
For some reason, the jar hadoop-aws-[version].jar which contains the implementation to NativeS3FileSystem is not present in the classpath of hadoop by default in the version 2.6 & 2.7. So, try and add it to the classpath by adding the following line in hadoop-env.sh which is located in $HADOOP_HOME/etc/hadoop/hadoop-env.sh:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*
Assuming you are using Apache Hadoop 2.6 or 2.7
By the way, you could check the classpath of Hadoop using:
bin/hadoop classpath
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'
import pyspark
sc = pyspark.SparkContext("local[*]")
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
hadoopConf = sc._jsc.hadoopConfiguration()
myAccessKey = input()
mySecretKey = input()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)
df = sqlContext.read.parquet("s3://myBucket/myKey")
#Ashrith's answer worked for me with one modification: I had to use $HADOOP_PREFIX rather than $HADOOP_HOME when running v2.6 on Ubuntu. Perhaps this is because it sounds like $HADOOP_HOME is being deprecated?
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${HADOOP_PREFIX}/share/hadoop/tools/lib/*
Having said that, neither worked for me on my Mac with v2.6 installed via Homebrew. In that case, I'm using this extremely cludgy export:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$(brew --prefix hadoop)/libexec/share/hadoop/tools/lib/*
To resolve this issue I tried all the above, which failed (for my environment anyway).
However I was able to get it working by copying the two jars mentioned above from the tools dir and into common/lib.
Worked fine after that.
If you are using HDP 2.x or greater you can try modifying the following property in the MapReduce2 configuration settings in Ambari.
mapreduce.application.classpath
Append the following value to the end of the existing string:
/usr/hdp/${hdp.version}/hadoop-mapreduce/*

How do I run sqlline with Phoenix?

When I try to run Phoenix's sqlline.py localhostcommand, I get
WARN util.DynamicClassLoader: Failed to identify the fs of
dir hdfs://localhost:54310/hbase/lib, ignored
java.io.IOException: No FileSystem for scheme:
hdfs at org.apache.hadoop.fs.FileSystem.getFileSystemClass...
and nothing else happens. I also could not get Squirrel to work (it freezes when I click 'list drivers').
As per these instructions, I have copied phoenix-4.2.1-server.jar to my hbase/lib folder and restarted hbase. I have also copied core-site.xml and hbase-site.xml to my phoenix/bin directory.
I have not added 'the phoenix-[version]-client.jar to the classpath of any Phoenix client'
since I do not know what this refers to.
I am using HBase 0.98.6.1-hadoop2, Phoenix 4.2.1 and hadoop 2.2.0.
I fix the same issue by adding setting in
${PHOENIX_HOME}/bin/hbase-site.xml
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
</property>

Unable to copy NCDC Data from Amazon AWS to Hadoop Cluster

I am trying to copy the NCDC Data from Amazon S3 to my local hadoop cluster by using following command.
hadoop distcp -Dfs.s3n.awsAccessKeyId='ABC' -Dfs.s3n.awsSecretAccessKey='XYZ' s3n://hadoopbook/ncdc/all input/ncdc/all
And getting error which is given below :
java.lang.IllegalArgumentException: AWS Secret Access Key must be specified as the password of a s3n URL, or by setting the fs.s3n.awsSecretAccessKey property
Gone through the following question but of no big help.
Problem with Copying Local Data
Any hint about how to solve the problem . Detailed answer will be very appreciated for better understanding. Thanks
Have you tried this:
Excerpt from AmazonS3 Wiki
Here is an example copying a nutch segment named 0070206153839-1998 at
/user/nutch in hdfs to an S3 bucket named 'nutch' (Let the S3
AWS_ACCESS_KEY_ID be 123 and the S3 AWS_ACCESS_KEY_SECRET be 456):
% ${HADOOP_HOME}/bin/hadoop distcp
hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998
s3://123:456#nutch/
In you case, it should be something like this:
hadoop distcp s3n://ABC:XYZ#hadoopbook/ncdc/all hdfs://IPaddress:port/input/ncdc/all
You need to set up the aws id and password in the core-site.xml
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>xxxxxxx</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>xxxxxxxxx</value>
</property>
and restart your cluster

hadoop hdfs points to file:/// not hdfs://

So I installed Hadoop via Cloudera Manager cdh3u5 on CentOS 5. When I run cmd
hadoop fs -ls /
I expected to see the contents of hdfs://localhost.localdomain:8020/
However, it had returned the contents of file:///
Now, this goes without saying that I can access my hdfs:// through
hadoop fs -ls hdfs://localhost.localdomain:8020/
But when it came to installing other applications such as Accumulo, accumulo would automatically detect Hadoop Filesystem in file:///
Question is, has anyone ran into this issue and how did you resolve it?
I had a look at HDFS thrift server returns content of local FS, not HDFS , which was a similar issue, but did not solve this issue.
Also, I do not get this issue with Cloudera Manager cdh4.
By default, Hadoop is going to use local mode. You probably need to set fs.default.name to hdfs://localhost.localdomain:8020/ in $HADOOP_HOME/conf/core-site.xml.
To do this, you add this to core-site.xml:
<property>
<name>fs.default.name</name>
<value>hdfs://localhost.localdomain:8020/</value>
</property>
The reason why Accumulo is confused is because it's using the same default configuration to figure out where HDFS is... and it's defaulting to file://
We should specify data node data directory and name node meta data directory.
dfs.name.dir,
dfs.namenode.name.dir,
dfs.data.dir,
dfs.datanode.data.dir,
fs.default.name
in core-site.xml file and format name node.
To format HDFS Name Node:
hadoop namenode -format
Enter 'Yes' to confirm formatting name node. Restart HDFS service and deploy client configuration to access HDFS.
If you have already did the above steps. Ensure client configuration is deployed correctly and it points to the actual cluster endpoints.

Run a Local file system directory as input of a Mapper in cluster

I gave an input to the mapper from a local filesystem.It is running successfully from eclipse,But not running from the cluster as it is unable to find the local input path saying:input path does not exist.Please can anybody help me how to give a local file path to a mapper so that it can run in the cluster and i can get the output in hdfs
This is a very old question. Recently faced the same issue.
I am not aware of how correct this solution is it worked for me though. Please bring to notice if there are any drawbacks of this.Here's what I did.
Reading a solution from the mail-archives, I realised if i modify fs.default.name from hdfs://localhost:8020/ to file:/// it can access the local file system. However, I didnt want this for all my mapreduce jobs. So I made a copy of core-site.xml in a local system folder (same as the one from where I would submit my MR jar to hadoop jar).
and in my Driver class for MR I added,
Configuration conf = new Configuration();
conf.addResource(new Path("/my/local/system/path/to/core-site.xml"));
conf.addResource(new Path("/usr/lib/hadoop-0.20-mapreduce/conf/hdfs-site.xml"));
The MR takes input from local system and writes the output to hdfs:
Running in a cluster requires the data to be loaded into distributed storage (HDFS). Copy the data to HDFS first using hadoop fs -copyFromLocal and then try to trun your job again, giving it the path of the data in HDFS
The question is an interesting one. One can have data on S3 and access this data without an explicit copy to HDFS prior to running the job. In the wordcount example, one would specify this as follows:
hadoop jar example.jar wordcount s3n://bucket/input s3n://bucket/output
What occurs in this is that the mappers read records directly from S3.
If this can be done with S3, why wouldn't hadoop similarly, using this syntax instead of s3n
file:///input file:///output
?
But empirically, this seems to fail in an interesting way -- I see that Hadoop gives a file not found exception for a file that is indeed in the input directory. That is, it seems to be able to list the files in the put directory on my local disk but when it comes time to open them to read the records, the file is not found (or accessible).
The data must be on HDFS for any MapReduce job to process it. So even if you have a source such as local File System or a network path or a web based store (such as Azure Blob Storage or Amazon Block stoage), you would need to copy the data at HDFS first and then run the Job.
The bottom line is that you would need to push the data first to to HDFS and there are several ways depend on data source, you would perform the data transfer from your source to HDFS such as from local file system you would use the following command:
$hadoop -f CopyFromLocal SourceFileOrStoragePath _HDFS__Or_directPathatHDFS_
Try setting the input path like this
FileInputFormat.addInputPath(conf, new Path(file:///the directory on your local filesystem));
if you give the file extension, it can access files from the localsystem
I have tried the following code and got the solution...
Please try it and let me know..
You need to get FileSystem object for local file system and then use makequalified method to return path.. As we need to pass path of local filesystem(no other way to pass this to inputformat), i ve used make qualified, which in deed returns only local file system path..
The code is shown below..
Configuration conf = new Configuration();
FileSystem fs = FileSystem.getLocal(conf);
Path inputPath = fs.makeQualified(new Path("/usr/local/srini/")); // local path
FileInputFormat.setInputPaths(job, inputPath);
I hope this works for your requirement, though it's posted very late.. It worked fine for me.. It does not need any configuration changes i believe..
U might wanna try this by setting the configuration as
Configuration conf=new Configuration();
conf.set("job.mapreduce.tracker","local");
conf.set("fs.default.name","file:///");
After this u can set the fileinputformat with the local path and u r good to go

Resources