Unable to copy NCDC Data from Amazon AWS to Hadoop Cluster - hadoop

I am trying to copy the NCDC Data from Amazon S3 to my local hadoop cluster by using following command.
hadoop distcp -Dfs.s3n.awsAccessKeyId='ABC' -Dfs.s3n.awsSecretAccessKey='XYZ' s3n://hadoopbook/ncdc/all input/ncdc/all
And getting error which is given below :
java.lang.IllegalArgumentException: AWS Secret Access Key must be specified as the password of a s3n URL, or by setting the fs.s3n.awsSecretAccessKey property
Gone through the following question but of no big help.
Problem with Copying Local Data
Any hint about how to solve the problem . Detailed answer will be very appreciated for better understanding. Thanks

Have you tried this:
Excerpt from AmazonS3 Wiki
Here is an example copying a nutch segment named 0070206153839-1998 at
/user/nutch in hdfs to an S3 bucket named 'nutch' (Let the S3
AWS_ACCESS_KEY_ID be 123 and the S3 AWS_ACCESS_KEY_SECRET be 456):
% ${HADOOP_HOME}/bin/hadoop distcp
hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998
s3://123:456#nutch/
In you case, it should be something like this:
hadoop distcp s3n://ABC:XYZ#hadoopbook/ncdc/all hdfs://IPaddress:port/input/ncdc/all

You need to set up the aws id and password in the core-site.xml
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>xxxxxxx</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>xxxxxxxxx</value>
</property>
and restart your cluster

Related

Hadoop can list s3 contents but spark-shell throws ClassNotFoundException

My saga continues -
In short I'm trying to create a teststack for spark - aim being to read a file from an s3 bucket and then write it to another. Windows env.
I was repeatedly encountering errors when trying to access S3 or S3n as a ClassNotFoundException was being thrown. These classes were added to the core-site.xml as the s3 and s3n.impl
I added the hadoop/share/tools/lib to the classpath to no avail, I then added the aws-java-jdk and hadoop-aws jars to the share/hadoop/common folder and I am now able to list the contents of a bucket using haddop on the command line.
hadoop fs -ls "s3n://bucket" shows me the contents, this is great news :)
In my mind the hadoop configuration should be picked up by spark so solving one should solve the other however when I run spark-shell and try to save a file to s3 I get the usual ClassNotFoundException as shown below.
I'm still quite new to this and unsure if I've missed something obvious, hopefully someone can help me solve the riddle? Any help is greatly appreciated, thanks.
The exception:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
my core-site.xml(which I believe to be correct now as hadoop can access s3):
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>
<property>
<name>fs.s3n.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
<description>The FileSystem for s3n: (Native S3) uris.</description>
</property>
and finally the hadoop-env.cmd showing the classpath(which is seemingly ignored):
set HADOOP_CONF_DIR=C:\Spark\hadoop\etc\hadoop
#rem ##added as s3 filesystem not found.http://stackoverflow.com/questions/28029134/how-can-i-access-s3-s3n-from-a-local-hadoop-2-6-installation
set HADOOP_USER_CLASSPATH_FIRST=true
set HADOOP_CLASSPATH=%HADOOP_CLASSPATH%:%HADOOP_HOME%\share\hadoop\tools\lib\*
#rem Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.
if exist %HADOOP_HOME%\contrib\capacity-scheduler (
if not defined HADOOP_CLASSPATH (
set HADOOP_CLASSPATH=%HADOOP_HOME%\contrib\capacity-scheduler\*.jar
) else (
set HADOOP_CLASSPATH=%HADOOP_CLASSPATH%;%HADOOP_HOME%\contrib\capacity-scheduler\*.jar
)
)
EDIT: spark-defaults.conf
spark.driver.extraClassPath=C:\Spark\hadoop\share\hadoop\common\lib\hadoop-aws-2.7.1.jar:C:\Spark\hadoop\share\hadoop\common\lib\aws-java-sdk-1.7.4.jar
spark.executor.extraClassPath=C:\Spark\hadoop\share\hadoop\common\lib\hadoop-aws-2.7.1.jar:C:\Spark\hadoop\share\hadoop\common\lib\aws-java-sdk-1.7.4.jar
You need to pass some parameters to your spark-shell. Try this flag --packages org.apache.hadoop:hadoop-aws:2.7.2 .

Access hdfs from outside the cluster

I have a hadoop cluster on aws and I am trying to access it from outside the cluster through a hadoop client. I can successfully hdfs dfs -ls and see all contents but when I try to put or get a file I get this error:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.fs.FsShell.displayError(FsShell.java:304)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:289)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
I have hadoop 2.6.0 installed in both my cluster and my local machine. I have copied the conf files of the cluster to the local machine and have these options in hdfs-site.xml (along with some other options).
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
<property>
<name>dfs.permissions.enable</name>
<value>false</value>
</property>
My core-site.xml contains a single property in both the cluster and the client:
<property>
<name>fs.defaultFS</name>
<value>hdfs://public-dns:9000</value>
<description>NameNode URI</description>
</property>
I found similar questions but wasn't able to find a solution to this.
How about you SSH into that machine?
I know this is a very bad idea but to get the work done, you can first copy that file on machine using scp and then SSH into that cluster/master and do hdfs dfs -put on that copied local file.
You can also automate this via a script but again, this is just to get the work done for now.
Wait for someone else to answer to know the proper way!
I had similar issue with my cluster when running hadoop fs -get and I could resolve it. Just check if all your data nodes are resolvable using FQDN(Fully Qualified Domain Name) from your local host. In my case nc command was successful using ip addresses for data nodes but not with host name.
run below command :
for i in cat /<host list file>; do nc -vz $i 50010; done
50010 is default datanode port
when you run any hadoop command it try to connect to data nodes using FQDN and thats where it gives this weird NPE.
Do below export and run your hadoop command
export HADOOP_ROOT_LOGGER=DEBUG,console
you will see this NPE comes when it is trying to connect to any datanode for data transfer.
I had a java code which was also doing hadoop fs -get using APIs and there ,exception was more clearer
java.lang.Exception: java.nio.channels.UnresolvedAddressException
Let me know if this helps you.

Run Apache Flink with Amazon S3

Does someone succeed to use Apache Flink 0.9 to process data stored on AWS S3? I found they are using own S3FileSystem instead of one from Hadoop... and it looks like it doesn't work.
I put the following path s3://bucket.s3.amazonaws.com/folder
it's failed with the following exception:
java.io.IOException: Cannot establish connection to Amazon S3:
com.amazonaws.services.s3.model.AmazonS3Exception: The request
signature we calculated does not match the signature you provided.
Check your key and signing method. (Service: Amazon S3; Status Code:
403;
Update May 2016: The Flink documentation now has a page on how to use Flink with AWS
The question has been asked on the Flink user mailing list as well and I've answered it over there: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Processing-S3-data-with-Apache-Flink-td3046.html
tl;dr:
Flink program
public class S3FileSystem {
public static void main(String[] args) throws Exception {
ExecutionEnvironment ee = ExecutionEnvironment.createLocalEnvironment();
DataSet<String> myLines = ee.readTextFile("s3n://my-bucket-name/some-test-file.xml");
myLines.print();
}
}
Add the following to core-site.xml and make it available to Flink:
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>putKeyHere</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>putSecretHere</value>
</property>
<property>
<name>fs.s3n.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>
you can retrieve the artifacts from the S3 bucket that is specified in the output section of the CloudFormation template.
i.e. After the Flink runtime is up and running, the taxi stream processor program can be submitted to the Flink runtime to start the real-time analysis of the trip events in the Amazon Kinesis stream.
$ aws s3 cp s3://«artifact-bucket»/artifacts/flink-taxi-stream-processor-1.0.jar .
$ flink run -p 8 flink-taxi-stream-processor-1.0.jar --region «AWS region» --stream «Kinesis stream name» --es-endpoint https://«Elasticsearch endpoint»
Both of the above commands use Amazon's S3 as source, you have to specify the artifact name accordingly.
Note: you can follow the link below and make a pipeline using EMR and S3 buckets.
https://aws.amazon.com/blogs/big-data/build-a-real-time-stream-processing-pipeline-with-apache-flink-on-aws/

Which core-site.xml do I add my AWS access keys to?

I want to run Spark code on EC2 against data stored in my S3 bucket. According to both the Spark EC2 documentation and the Amazon S3 documentation, I have to add my AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to the core-site.xml file. However, when I shell into my master EC2 node, I see several core-site.xml files.
$ find . -name core-site.xml
./mapreduce/conf/core-site.xml
./persistent-hdfs/share/hadoop/templates/conf/core-site.xml
./persistent-hdfs/src/packages/templates/conf/core-site.xml
./persistent-hdfs/src/contrib/test/core-site.xml
./persistent-hdfs/src/test/core-site.xml
./persistent-hdfs/src/c++/libhdfs/tests/conf/core-site.xml
./persistent-hdfs/conf/core-site.xml
./ephemeral-hdfs/share/hadoop/templates/conf/core-site.xml
./ephemeral-hdfs/src/packages/templates/conf/core-site.xml
./ephemeral-hdfs/src/contrib/test/core-site.xml
./ephemeral-hdfs/src/test/core-site.xml
./ephemeral-hdfs/src/c++/libhdfs/tests/conf/core-site.xml
./ephemeral-hdfs/conf/core-site.xml
./spark-ec2/templates/root/mapreduce/conf/core-site.xml
./spark-ec2/templates/root/persistent-hdfs/conf/core-site.xml
./spark-ec2/templates/root/ephemeral-hdfs/conf/core-site.xml
./spark-ec2/templates/root/spark/conf/core-site.xml
./spark/conf/core-site.xml
After some experimentation, I determined that I can access an s3n url like s3n://mcneill-scratch/GR.txt from Spark only if I add my credentials to both mapreduce/conf/core-site.xml and spark/conf/core-site.xml.
This seems wrong to me. It's not DRY, and I can't find anything in the documentation that says you have to add your credentials to multiple files.
Is modifying multiple files the correct way to set of s3 credentials via core-site.xml? Is there documentation somewhere that explains this?
./spark/conf/core-site.xml should be the right place

hadoop hdfs points to file:/// not hdfs://

So I installed Hadoop via Cloudera Manager cdh3u5 on CentOS 5. When I run cmd
hadoop fs -ls /
I expected to see the contents of hdfs://localhost.localdomain:8020/
However, it had returned the contents of file:///
Now, this goes without saying that I can access my hdfs:// through
hadoop fs -ls hdfs://localhost.localdomain:8020/
But when it came to installing other applications such as Accumulo, accumulo would automatically detect Hadoop Filesystem in file:///
Question is, has anyone ran into this issue and how did you resolve it?
I had a look at HDFS thrift server returns content of local FS, not HDFS , which was a similar issue, but did not solve this issue.
Also, I do not get this issue with Cloudera Manager cdh4.
By default, Hadoop is going to use local mode. You probably need to set fs.default.name to hdfs://localhost.localdomain:8020/ in $HADOOP_HOME/conf/core-site.xml.
To do this, you add this to core-site.xml:
<property>
<name>fs.default.name</name>
<value>hdfs://localhost.localdomain:8020/</value>
</property>
The reason why Accumulo is confused is because it's using the same default configuration to figure out where HDFS is... and it's defaulting to file://
We should specify data node data directory and name node meta data directory.
dfs.name.dir,
dfs.namenode.name.dir,
dfs.data.dir,
dfs.datanode.data.dir,
fs.default.name
in core-site.xml file and format name node.
To format HDFS Name Node:
hadoop namenode -format
Enter 'Yes' to confirm formatting name node. Restart HDFS service and deploy client configuration to access HDFS.
If you have already did the above steps. Ensure client configuration is deployed correctly and it points to the actual cluster endpoints.

Resources