AWS Access Key ID failed on hadoop fs -cp command - hadoop

I was trying to run the hadoop fs -cp command but got the following error message:
-cp: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively)
I'm new to hadoop and s3 so can anyone please give advice on what I should do?
Thanks!

Please refer .
go to
cd ${HADOOP_HOME}/conf/hadoop-site.xml and add/update
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>Your AWS ACCESS KEY</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>Your AWS Secret Access Key</value>
</property>

Related

Unable to copy HDFS data to S3 bucket

I have an issue related to a similar question asked before. I'm unable to copy data from HDFS to an S3 bucket in IBM Cloud.
I use command: hadoop distcp hdfs://namenode:9000/user/root/data/ s3a://hdfs-backup/
I've added extra properties in /etc/hadoop/core-site.xml file:
<property>
<name>fs.s3a.access.key</name>
<value>XXX</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>XXX</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.eu-de.cloud-object-storage.appdomain.cloud</value>
</property>
<property>
<name>fs.s3a.multipart.size</name>
<value>104857600</value>
</property>
I receive following error message:
root#e05ffff9bac9:/etc/hadoop# hadoop distcp hdfs://namenode:9000/user/root/data/ s3a://hdfs-backup/
2021-04-29 13:29:36,723 ERROR tools.DistCp: Invalid arguments:
java.lang.IllegalArgumentException
at java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1314)
at java.util.concurrent.ThreadPoolExecutor.<init>(ThreadPoolExecutor.java:1237)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:280)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:240)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:441)
Invalid arguments: null
Connection to S3 bucket with AWS CLI works fine. Thanks in advance for help!

Cannot connect to hive using beeline, user root cannot impersonate anonymous

I'm trying to connect to hive using beeline !connect jdbc:hive2://localhost:10000 and I'm being asked for a username and password
Connecting to jdbc:hive2://localhost:10000'
Enter username for jdbc:hive2://localhost:10000:
Enter password for jdbc:hive2://localhost:10000:
As I don't know what username or password I'm supposed to type in I'm leaving it empty which causes the error: Error: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: root is not allowed to impersonate anonymous (state=,code=0)
My setup is a single node hadoop cluster in ubuntu.
I can confirm that the services are up and running, both hadoop and hiveserver2
The question is , what are these username and password I'm being asked, where can I find them or set them?
Thanks in advance
You should provide a valid username and password that has privileges to access the HDFS and Hive Services (user running HiveServer2). For your setup, the user in which Hadoop and Hive are installed would be the superuser.
These credentials will be used by beeline to initiate a connection with HiveServer2.
And, add these properties in core-site.xml
<property>
<name>hadoop.proxyuser.username.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.username.hosts</name>
<value>*</value>
</property>
Restart services after adding these properties.
Then run beeline with the specified user name username as below:
beeline -u jdbc:hive2://localhost:10000 -u username
ref: https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-Impersonation
Alternatively, you can also set the parameter hive.server2.enable.doAs to false to disable user impersonation.
In hive-site.xml need to set the parameter hive.server2.enable.doAs to
false
<property>
<name>hive.server2.enable.doAs</name>
<value>FALSE</value>
<description>
Setting this property to true will have HiveServer2 execute
Hive operations as the user making the calls to it.
</description>
</property>
http://mail-archives.apache.org/mod_mbox/hive-user/201602.mbox/%3C54b7754ceb8370b7250bba929369763f#cloudtechnologypartners.co.uk%3E

Tachyon configuration for s3 under filesystem

I am trying to set up Tachyon on S3 filesystem. For HDFS, tachyon has a parameter called TACHYON_UNDERFS_HDFS_IMPL which is set to "org.apache.hadoop.hdfs.DistributedFileSystem". Does anyone know if such a parameter exists for S3? If so, what is its value?
Thanks in advance for any help!
Hadoop FS type you mentioned (org.apache.hadoop.hdfs.DistributedFileSystem) is just the interface and it fits your needs. Instead, Tachyon create the s3n FileSystem implementation basing on scheme specified in the uri of remote dfs which is configured with TACHYON_UNDERFS_ADDRESS.
For Amazon, you will need to specify something like this:
export TACHYON_UNDERFS_ADDRESS=s3n://your_bucket
Note "s3n", not "s3" here.
Additional setup you will need to work with s3 (see also
Error in setting up Tachyon on S3 under filesystem and http://tachyon-project.org/Setup-UFS.html):
in ${TACHYON}/bin/tachyon-env.sh: add key id and the secret key to TACHYON_JAVA_OPTS:
-Dfs.s3n.awsAccessKeyId=123
-Dfs.s3n.awsSecretAccessKey=456
Publish extra dependencies required by s3n Hadoop FileSystem implementation, the version depends on the version of Hadoop installed. These are : commons-httpclients-* and jets3t-*.
For that, publish the TACHYON_CLASSPATH as mentioned in one of links above. This can be done by adding export of TACHYON_CLASSPATH in ${TACHYON}/libexec/tachyon-config.sh before exporting CLASSPATH:
export TACHYON_CLASSPATH=~/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:~/.m2/repository/net/java/dev/jets3t/jets3t/0.9.0/jets3t-0.9.0.jar
export CLASSPATH="$TACHYON_CONF_DIR/:$TACHYON_JAR:$TACHYON_CLASSPATH":
Start Tachyon cluster:
./bin/tachyon format
./bin/tachyon-start.sh local
Check its availability via web interface:
http://localhost:19999/
in logs:
${TACHYON}/logs
Your core-site.xml should contain following sections to make sure you are integrated with Tachyon (see Spark reference http://tachyon-project.org/Running-Spark-on-Tachyon.html for configuration right from scala)
fs.defaultFS - specify the Tachyon master host-port (below are defaults)
fs.default.name - default name of fs, the same as before
fs.tachyon.impl - Tachyon's hadoop.FileSystem implementation hint
fs.s3n.awsAccessKeyId - Amazon key id
fs.s3n.awsSecretAccessKey - Amazon secret key
<configuration>
<property>
<name>fs.defaultFS</name>
<value>tachyon://localhost:19998</value>
</property>
<property>
<name>fs.default.name</name>
<value>tachyon://localhost:19998</value>
<description>The name of the default file system. A URI
whose scheme and authority determine the
FileSystem implementation.
</description>
</property>
<property>
<name>fs.tachyon.impl</name>
<value>tachyon.hadoop.TFS</value>
</property>
...
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>123</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>345</value>
</property>
...
</configuration>
Refer to any path using tachyon scheme and master host port:
tachyon://master_host:master_port/path
Example with default Tachyon master host-port:
tachyon://localhost:19998/remote_dir/remote_file.csv

After changing CDH5 Kerberos Authentication i am not able to access hdfs

I am trying to implement Kerberos authentication. I am using Hadoop 2.3 version of hadoop on cdh5.0.1. I have done the following changes :
Added following properties to core-site.xml
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
After restarting the daemon when i am issuing hadoop fs -ls / command, I am getting following error :
ls: Failed on local exception: java.io.IOException: Server asks us to fall back to SIMPLE auth, but this client is configured to only allow secure connections.; Host Details : local host is: "cldx-xxxx-xxxx/xxx.xx.xx.xx"; destination host is: "cldx-xxxx-xxxx":8020;
Please help me out.
Thanks in advance,
Ankita Singla
There is a lot more to configuring a secure HDFS cluster than just specifying hadoop.security.authentication as Kerberos. See Configuring Hadoop Security in CDH 5 about the required config settings. You'll need to create appropriate keytab files. Only after you configured everything and you confirmed that none of the Hadoop services report any error in their respective logs (namenode, datanode on all hosts, resourcemanager, nodemanager on all nodes etc) can you attempt to connect.

Hue File Browser not working

i have installed hue and the file browser in hue is not working and is throwing a "Server Error (500)"
data from error.log
webhdfs ERROR Failed to determine superuser of WebHdfs at http://namenode:50070/webhdfs/v1: SecurityException: Failed to obtain user group information: org.apache.hadoop.security.authorize.AuthorizationException: User: hue is not allowed to impersonate hue (error 401)
Traceback (most recent call last):
File "/home/hduser/huef/hue/desktop/libs/hadoop/src/hadoop/fs/webhdfs.py", line 108, in superuser
sb = self.stats('/')
File "/home/hduser/huef/hue/desktop/libs/hadoop/src/hadoop/fs/webhdfs.py", line 188, in stats
res = self._stats(path)
File "/home/hduser/huef/hue/desktop/libs/hadoop/src/hadoop/fs/webhdfs.py", line 182, in _stats
raise ex
Note : i have added the following to core-site.xml and i have enabled webhdfs
<property>
<name>hadoop.proxyuser.hue.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hue.groups</name>
<value>*</value>
</property>
Error when i try to access hdfs file location through oozie in hue
An error occurred: SecurityException: Failed to obtain user group information: org.apache.hadoop.security.authorize.AuthorizationException: User: hue is not allowed to impersonate hduser (error 401)
core-site.xml
<property>
<name>hadoop.proxyuser.hue.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hue.groups</name>
<value>*</value>
</property>
hdfs-site.xml
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
You need to specify hduser as the proxy user:
<property>
<name>hadoop.proxyuser.hduser.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hduser.groups</name>
<value>*</value>
</property>
BTW: why are you not running Hue as 'hue'?
What user are you logged as?
I had same issue, my solution was to create a HUE user called "hdfs" and added "hue" Linux user in "hadoop" and "hdfs" linux users groups.
So now I am logged in as "hdfs" user in HUE web UI.
You may see it says Failed to obtain user group information.
According to Hadoop docs, the group info is gathered by invoking shell command (on *nix system) groups $USERNAME. Therefore, the matching user MUST EXIST as a Linux user in HDFS Namenode, where authentication process occurs.
So the solution is simple as,
useradd hue -g root
On the Namenode.
I'm deploying hdfs in a docker container, so I use group root. The value is the same as the user running the Namenode process (who is definitely the superuser).

Resources