DistCp from Local Hadoop to Amazon S3 - hadoop

I'm trying to use distcp to copy a folder from my local hadoop cluster (cdh4) to my Amazon S3 bucket.
I use the following command:
hadoop distcp -log /tmp/distcplog-s3/ hdfs://nameserv1/tmp/data/sampledata s3n://hdfsbackup/
hdfsbackup is the name of my Amazon S3 Bucket.
DistCp fails with unknown host exception:
13/05/31 11:22:33 INFO tools.DistCp: srcPaths=[hdfs://nameserv1/tmp/data/sampledata]
13/05/31 11:22:33 INFO tools.DistCp: destPath=s3n://hdfsbackup/
No encryption was performed by peer.
No encryption was performed by peer.
13/05/31 11:22:35 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 54 for hadoopuser on ha-hdfs:nameserv1
13/05/31 11:22:35 INFO security.TokenCache: Got dt for hdfs://nameserv1; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameserv1, Ident: (HDFS_DELEGATION_TOKEN token 54 for hadoopuser)
No encryption was performed by peer.
java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfsbackup
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414)
at org.apache.hadoop.security.SecurityUtil.buildDTServiceName(SecurityUtil.java:295)
at org.apache.hadoop.fs.FileSystem.getCanonicalServiceName(FileSystem.java:282)
at org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:503)
at org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:487)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:130)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:111)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:85)
at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1046)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:666)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
Caused by: java.net.UnknownHostException: hdfsbackup
... 14 more
I have the AWS ID/Secret configured in the core-site.xml of all nodes.
<!-- Amazon S3 -->
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>MY-ID</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>MY-SECRET</value>
</property>
<!-- Amazon S3N -->
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>MY-ID</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>MY-SECRET</value>
</property>
I'm able to copy files from hdfs using the cp command without any problem. The below command successfully copied the hdfs folder to S3
hadoop fs -cp hdfs://nameserv1/tmp/data/sampledata s3n://hdfsbackup/
I know there is Amazon S3 optimized distcp (s3distcp) available, but I don't want to use it as it doesn't support update/overwrite options.

It looks like you are using Kerberos security, and unfortunately Map/Reduce jobs cannot access Amazon S3 currently if Kerberos is enabled. You can see more details in MAPREDUCE-4548.
They actually have a patch that should fix it but is not currently part of any Hadoop distribution, so if you have an opportunity to modify and build Hadoop from source here is what you should do:
Index: core/org/apache/hadoop/security/SecurityUtil.java
===================================================================
--- core/org/apache/hadoop/security/SecurityUtil.java (révision 1305278)
+++ core/org/apache/hadoop/security/SecurityUtil.java (copie de travail)
## -313,6 +313,9 ##
if (authority == null || authority.isEmpty()) {
return null;
}
+ if (uri.getScheme().equals("s3n") || uri.getScheme().equals("s3")) {
+ return null;
+ }
InetSocketAddress addr = NetUtils.createSocketAddr(authority, defPort);
return buildTokenService(addr).toString();
}
The ticket was last updated a couple days ago, so hopefully this will be officially patched soon.
An easier solution would be to just disable Kerberos, but that might not be possible in your environment.
I've seen that you might be able to do this if your bucket is named like a domain name, but I haven't tried it and even if this works this sounds like a hack.

Related

Facing issue in setting up oozie with secure MapR cluster

We are facing an issue with setting up an oozie service with secure mapr cluster.
We are using the MapR installer to setup the MapR Cluster. Below are the configuration and steps that we followed.
MapR version - 6.1
Os - Ubuntu 16.04
Authentication - Kerberos
Nodes - Single node
We have enabled the Mapr security by using the Enable Secure Cluster option in the installer.
Reference doc - https://docs.datafabric.hpe.com/61/AdvancedInstallation/using_enable_secure_cluster_option.html
We have installed the kerberos in the machine.
Reference doc - https://linuxconfig.org/how-to-install-kerberos-kdc-server-and-client-on-ubuntu-18-04
Below are the commands we executed to setup kerberos authentication for the MapR cluster
Reference docs -
https://docs.datafabric.hpe.com/61/SecurityGuide/Configuring-Kerberos-User-Authentication.html
https://docs.datafabric.hpe.com/61/SecurityGuide/ConfiguringSPNEGOonMapR.html
sudo kadmin.local
addprinc -randkey mapr/my.cluster.com
ktadd -k /opt/mapr/conf/mapr.keytab mapr/my.cluster.com
addprinc -randkey HTTP/<instance-name>#<realm-name>
ktadd -k /opt/mapr/conf/http.keytab HTTP/<instance-name>#<realm-name>
addprinc -randkey mapr/<instance-name>#<realm-name>
ktadd -k /opt/mapr/conf/mapr2.keytab mapr/<instance-name>#<realm-name>
sudo chown mapr:mapr /opt/mapr/conf/mapr.keytab /opt/mapr/conf/http.keytab /opt/mapr/conf/mapr2.keytab
sudo chmod 777 /opt/mapr/conf/mapr.keytab /opt/mapr/conf/http.keytab /opt/mapr/conf/mapr2.keytab
ktutil
rkt /opt/mapr/conf/mapr.keytab
rkt /opt/mapr/conf/http.keytab
rkt /opt/mapr/conf/mapr2.keytab
wkt /opt/mapr/conf/mapr.keytab
sudo /opt/mapr/server/configure.sh -N my.cluster.com -C <CLDB Node>:7222 -Z <ZookeeperNode>:5181 -K -P "mapr/my.cluster.com#<realm-name>"
Note:
The command which is mentioned in the doc (configure.sh -K -P "<cldbPrincipal>") throws error , but the above command works.
kinit
maprlogin kerberos
hadoop fs -ls
3.1 ) We are able to access the mapr file system.
3.2) We are using the below command to run a simple mapreduce job and it works fine.
hadoop jar /opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0-mapr-1808.jar pi 16 1000
Oozie configuration with kerberos authentication
Reference doc - https://docs.datafabric.hpe.com/61/Oozie/ConfiguringOozieonaSecureCluster.html
We have added below properties in the oozie-site.xml
<property>
<name>oozie.authentication.type</name>
<value>kerberos</value>
<description>
Defines authentication used for Oozie HTTP endpoint.
Supported values are: simple | kerberos | #AUTHENTICATION_HANDLER_CLASSNAME#
</description>
</property>
<property>
<name>oozie.service.HadoopAccessorService.keytab.file</name>
<value>/opt/mapr/conf/mapr.keytab</value>
<description>
Location of the Oozie user keytab file.
</description>
</property>
<property>
<name>local.realm</name>
<value>{local.realm}</value>
<description>
Kerberos Realm used by Oozie and Hadoop. Using 'local.realm' aligns with Hadoop configuration
</description>
</property>
<property>
<name>oozie.service.HadoopAccessorService.kerberos.principal</name>
<value>mapr/<hostname>#${local.realm}</value>
<description>
Kerberos principal for Oozie service.
</description>
</property>
<property>
<name>oozie.authentication.kerberos.principal</name>
<value>HTTP/<hostname>#${local.realm}</value>
<description>
Indicates the Kerberos principal to be used for the HTTP endpoint. The principal MUST start with 'HTTP/' per the Kerberos HTTP SPNEGO specification.
</description>
</property>
We are checking the oozie status by using bin/oozie admin -status -auth KERBEROS command , we are getting below error.
java.io.IOException: Error while connecting Oozie server. No of retries = 1. Exception = Could not authenticate, Authentication failed, status: 302
Kindly help us to resolve this issue
Oozie is a frigging nightmare in general. Adding Kerberos won't make it easier. Just saying.
The issue that you are describing appears to be that some component isn't getting the memo about the Kerberos identity that you are using or doesn't have access/permissions to validate an access. This is a common problem and typically requires step-by-step interaction to work through what is known and what is not yet known (but often is assumed). I am definitely not an expert on these kinds of issues, however.
You have a really excellent problem report here which is exactly the sort of thing that the support team can use.
Do you have an active support or partner in place?

Nutch 1.7 with Hadoop 2.6.0 "Wrong FS" Error

We have been trying to use Nutch 1.7 with Hadoop 2.6.0.
After installation, we we try to submit a job to Nutch, we receive the following error:
INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://master:9000/user/ubuntu/crawl/crawldb/436075385, expected: file:///
Job is submitted using the following command:
./crawl urls crawl_results 1
Also, we have checked fs.default.name setting in core-site.xml is having hdfs protocol:
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
It is happening when crawl command is sent to Nutch, after it reads the input URLs from file and attempts to insert the data into crawl db.
Any insights would be appreciated.
Thanks in advance.

After changing CDH5 Kerberos Authentication i am not able to access hdfs

I am trying to implement Kerberos authentication. I am using Hadoop 2.3 version of hadoop on cdh5.0.1. I have done the following changes :
Added following properties to core-site.xml
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
After restarting the daemon when i am issuing hadoop fs -ls / command, I am getting following error :
ls: Failed on local exception: java.io.IOException: Server asks us to fall back to SIMPLE auth, but this client is configured to only allow secure connections.; Host Details : local host is: "cldx-xxxx-xxxx/xxx.xx.xx.xx"; destination host is: "cldx-xxxx-xxxx":8020;
Please help me out.
Thanks in advance,
Ankita Singla
There is a lot more to configuring a secure HDFS cluster than just specifying hadoop.security.authentication as Kerberos. See Configuring Hadoop Security in CDH 5 about the required config settings. You'll need to create appropriate keytab files. Only after you configured everything and you confirmed that none of the Hadoop services report any error in their respective logs (namenode, datanode on all hosts, resourcemanager, nodemanager on all nodes etc) can you attempt to connect.

Issue in Connecting to HDFS Namenode

After a new hadoop single node installation , I got following error in hadoop-root-datanode-localhost.localdomain.log
2014-06-18 23:43:23,594 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:root cause:java.net.ConnectException: Call to localhost/127.0.0.1:54310 failed on connection exception: java.net.ConnectException: Connection refused
2014-06-18 23:43:23,595 INFO org.apache.hadoop.mapred.JobTracker: Problem connecting to HDFS Namenode... re-trying java.net.ConnectException: Call to localhost/127.0.0.1:54310 failed on connection exception: java.net.ConnectException: Connection refusedat org.apache.hadoop.ipc.Client.wrapException(Client.java:1142)
Any idea.?
JPS is not giving any ouput
Core site.xml is updated
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/surya/hadoop-1.2.1/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
Also , on format using hadoop namenode -format
got below aborted error
Re-format filesystem in /tmp/hadoop-root/dfs/name ? (Y or N) y
Format aborted in /tmp/hadoop-root/dfs/name
You need to run hadoop namenode -format as the hdfs-superuser. Probably the "hdfs" user itself.
The hint can be seen here:
UserGroupInformation: PriviledgedActionException as:root cause:java
Another thing to consider: You really want to move your hdfs root to something other than /tmp. You will risk losing your hdfs contents when /tmp is cleaned (which could happen any time)
UPDATE based on OP comments.
RE: JobTracker unable to contact NameNode: Please do not skip steps.
First make sure you format the NameNode
Then start the NameNode and DataNodes
Run some basic HDFS commands such as
hdfs dfs -put
and
hdfs dfs -get
Then you can start the JobTracker and TaskTracker
Then (and not earlier) you can try to run some MapReduce job (which uses hdfs)
1) Please run "jps" in console and show what it outputs
2) Please provide core-site.xml (I think you might have wrong fs.default.name)
Concerning this error:
Re-format filesystem in /tmp/hadoop-root/dfs/name ? (Y or N) y
Format aborted in /tmp/hadoop-root/dfs/name
You need to use a capital Y, not a lowercase y in order for it to accept the input and actually do the formatting.

Installing Hadoop on NFS

As a start, I've installed Hadoop (0.15.2) and setup a cluster of 3 nodes: one each for NameNode, DataNode and the JobTracker. All the daemons are up and running. But when I issue any command I get the above error. For instance, when I do a copyFromLocal, I get the following error:
Am I missing something?
More details:
I am trying to install Hadoop on an NFS file system. I've installed 1.0.4 version and tried running it but to of no avail. The 1.0.4 version doesn't start the datanode. And the log files for the datanode are empty. Hence I switched back to 0.15 version which started all the daemons atleast.
I believe the problem is due to the underlying NFS file system i.e. all the datanodes and masters using the same files and folders. But I am not sure if that is actually the case.
But I don't see any reason why I shouldn't be able to run Hadoop on NFS (after appropriately setting the configuration parameters).
Currently I am trying and figuring out if I could set the name and data directories differently for different machines based on the individual machine names.
Configuration file: (hadoop-site.xml)
<property>
<name>fs.default.name</name>
<value>mumble-12.cs.wisc.edu:9001</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>mumble-13.cs.wisc.edu:9001</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.secondary.info.port</name>
<value>9002</value>
</property>
<property>
<name>dfs.info.port</name>
<value>9003</value>
</property>
<property>
<name>mapred.job.tracker.info.port</name>
<value>9004</value>
</property>
<property>
<name>tasktracker.http.port</name>
<value>9005</value>
</property>
Error using Hadoop 1.0.4 (DataNode doesn't get started):
2013-04-22 18:50:50,438 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 9001, call addBlock(/tmp/hadoop-akshar/mapred/system/jobtracker.info, DFSClient_502734479, null) from 128.105.112.13:37204: error: java.io.IOException: File /tmp/hadoop-akshar/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
java.io.IOException: File /tmp/hadoop-akshar/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
Error using Hadoop 0.15.2:
[akshar#mumble-12] (38)$ bin/hadoop fs -copyFromLocal lib/junit-3.8.1.LICENSE.txt input
13/04/17 03:22:11 WARN fs.DFSClient: Error while writing.
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:189)
at java.net.SocketInputStream.read(SocketInputStream.java:121)
at java.net.SocketInputStream.read(SocketInputStream.java:203)
at java.io.DataInputStream.readShort(DataInputStream.java:312)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:1660)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1733)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:55)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:83)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:140)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:826)
at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:120)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1360)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1478)
13/04/17 03:22:12 WARN fs.DFSClient: Error while writing.
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:189)
at java.net.SocketInputStream.read(SocketInputStream.java:121)
at java.net.SocketInputStream.read(SocketInputStream.java:203)
at java.io.DataInputStream.readShort(DataInputStream.java:312)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:1660)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1733)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:55)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:83)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:140)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:826)
at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:120)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1360)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1478)
13/04/17 03:22:12 WARN fs.DFSClient: Error while writing.
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:189)
at java.net.SocketInputStream.read(SocketInputStream.java:121)
at java.net.SocketInputStream.read(SocketInputStream.java:203)
at java.io.DataInputStream.readShort(DataInputStream.java:312)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:1660)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1733)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:55)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:83)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:140)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:826)
at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:120)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1360)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1478)
copyFromLocal: Connection reset
I was able to get Hadoop to run over NFS using version 1.1.2. It might work for other versions, but I can't guarantee anything.
If you have an NFS file system then each node should have access to the filesystem. The fs.default.name tells Hadoop the filesystem URI to use, so it should be pointed to the local disk. I'll assume that your NFS directory is mounted to each node at /nfs.
In core-site.xml you should define:
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/nfs/tmp</value>
</property>
In mapred-site.xml you should define:
<property>
<name>mapred.job.tracker</name>
<value>node1:8021</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/tmp/mapred-local</value>
</property>
Since hadoop.tmp.dir is pointed to the nfs drive then the default locations of mapred.system.dir and mapreduce.jobtracker.staging.root.dir point to locations on the nfs drive. It might run with leaving the default value for mapred.local.dir, but it is supposed to point to the local filesystem so to be safe you can put that in /tmp.
You don't have to worry about hdfs-site.xml. This configuration file is used when you start the namenode, but with everything being distributed on the nfs drive you shouldn't run HDFS.
Now you can run start-mapred.sh on the jobtracker node and run a hadoop job. Don't run start-all.sh or start-dfs.sh because those will start HDFS. If you run multiple DataNodes that point to the same NFS directory, then one DataNode will lock that directory and the others will shutdown because they are unable to obtain a lock.
I tested the configuration with:
bin/hadoop jar hadoop-examples-1.1.2.jar wordcount /nfs/data/test.text /nfs/out
Note that you need to specify full paths to the input and output locations.
I also tried:
bin/hadoop jar hadoop-examples-1.1.2.jar grep /nfs/data/loremIpsum.txt /nfs/out2 lorem
It gave me the same output as when I run it in Standalone, so I assume it is performing correctly.
Here is more information on fs.default.name:
http://www.greenplum.com/blog/dive-in/usage-and-quirks-of-fs-default-name-in-hadoop-filesystem

Resources