Hdfs to s3 Distcp - Access Keys - hadoop

For copying the file from HDFS to S3 bucket I used the command
hadoop distcp -Dfs.s3a.access.key=ACCESS_KEY_HERE\
-Dfs.s3a.secret.key=SECRET_KEY_HERE /path/in/hdfs s3a:/BUCKET NAME
But the access key and sectet key are visible here which are not secure .
Is there any method to provide credentials from file .
I dont want to edit config file ,which is one of the method I came across .

I also faced the same situation, and after got temporary credentials from matadata instance. (in case you're using IAM User's credential, please notice that the temporary credentials mentioned here is IAM Role, which attach to EC2 server not human, refer http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html)
I found only specifying the credentials in the hadoop distcp cmd will not work.
You also have to specify a config fs.s3a.aws.credentials.provider. (refer http://hortonworks.github.io/hdp-aws/s3-security/index.html#using-temporary-session-credentials)
Final command will look like below
hadoop distcp -Dfs.s3a.aws.credentials.provider="org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider" -Dfs.s3a.access.key="{AccessKeyId}" -Dfs.s3a.secret.key="{SecretAccessKey}" -Dfs.s3a.session.token="{SessionToken}" s3a://bucket/prefix/file /path/on/hdfs

Recent (2.8+) versions let you hide your credentials in a jceks file; there's some documentation on the Hadoop s3 page there. That way: no need to put any secrets on the command line at all; you just share them across the cluster and then, in the distcp command, set hadoop.security.credential.provider.path to the path, like jceks://hdfs#nn1.example.com:9001/user/backup/s3.jceks
Fan: if you are running in EC2, the IAM role credentials should be automatically picked up from the default chain of credential providers: after looking for the config options & env vars, it tries a GET of the EC2 http endpoint which serves up the session credentials. If that's not happening, make sure that com.amazonaws.auth.InstanceProfileCredentialsProvider is on the list of credential providers. It's a bit slower than the others (and can get throttled), so best to put near the end.

Amazon allows to generate temporary credentials that you can retrieve from http://169.254.169.254/latest/meta-data/iam/security-credentials/
you can read from there
An application on the instance retrieves the security credentials provided by the role from the instance metadata item iam/security-credentials/role-name. The application is granted the permissions for the actions and resources that you've defined for the role through the security credentials associated with the role. These security credentials are temporary and we rotate them automatically. We make new credentials available at least five minutes prior to the expiration of the old credentials.
The following command retrieves the security credentials for an IAM role named s3access.
$ curl http://169.254.169.254/latest/meta-data/iam/security-credentials/s3access
The following is example output.
{
"Code" : "Success",
"LastUpdated" : "2012-04-26T16:39:16Z",
"Type" : "AWS-HMAC",
"AccessKeyId" : "AKIAIOSFODNN7EXAMPLE",
"SecretAccessKey" : "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"Token" : "token",
"Expiration" : "2012-04-27T22:39:16Z"
}
For applications, AWS CLI, and Tools for Windows PowerShell commands that run on the instance, you do not have to explicitly get the temporary security credentials — the AWS SDKs, AWS CLI, and Tools for Windows PowerShell automatically get the credentials from the EC2 instance metadata service and use them. To make a call outside of the instance using temporary security credentials (for example, to test IAM policies), you must provide the access key, secret key, and the session token. For more information, see Using Temporary Security Credentials to Request Access to AWS Resources in the IAM User Guide.

if you do not want to use access and secret key (or show them on your scripts) and if your EC2 instance has access to S3 then you can use the instance credentials
hadoop distcp \
-Dfs.s3a.aws.credentials.provider="com.amazonaws.auth.InstanceProfileCredentialsProvider" \
/hdfs_folder/myfolder \
s3a://bucket/myfolder

Not sure if it is because of a version difference, but to use "secrets from credential providers" the -Dfs flag would not work for me, I had to use the -D flag as shown on the hadoop version 3.1.3 "Using_secrets_from_credential_providers" docs.
First I saved my AWS S3 credentials in a Java Cryptography Extension KeyStore (JCEKS) file.
hadoop credential create fs.s3a.access.key \
-provider jceks://hdfs/user/$USER/s3.jceks \
-value <my_AWS_ACCESS_KEY>
hadoop credential create fs.s3a.secret.key \
-provider jceks://hdfs/user/$USER/s3.jceks \
-value <my_AWS_SECRET_KEY>
Then the following distcp command format worked for me.
hadoop distcp \
-D hadoop.security.credential.provider.path=jceks://hdfs/user/$USER/s3.jceks \
/hdfs_folder/myfolder \
s3a://bucket/myfolder

Related

Kerberos HTTP service Using GSS shows No valid credentials due to domain name or host name mismatch

I am having a Micro-Service Platform having multiple Micro-Services connected to each other, Platform uses Kerberos for authentication of Micro-Services. In One of Micro-Service Node hadoop is installed which uses separate KDC for Hadoop cluster authentication.
Lets say platform domain is "idm.com" and hadoop domain is "hadoop.com".
Resource Manager is running on one node. I have configure HTTP principal for spnego in core-site.xml using "hadoop.http.authentication.kerberos.principal" property to "HTTP/master.hadoop.com#HADOOP.COM" and nodes Hostname is "hadoopmaster.idm.com".
I do Kinit and acquire root user ticket from TGS. When I tried to do curl using "curl -k -v --negotiate -u : https://master.hadoop.com:8090/cluster" It shows GSS Exception: No valid credentials provided.
If I see klist it shows two ticket one krbtgt and second "HTTP/hadoopmaster.idm.com#HADOOP.COM"(I have added this principal in kdc database). First krbtgt i got using kinit and second HTTP one i Got it automatically after doing curl before curl the ticket was not there. Krb client acquired another for using HTTP service.
After some debugging I noticed the problem/behaviour is I got ticket for HTTP/hadoopmaster.idm.com#HADOOP.COM where I have configure hadoop to use HTTP/master.hadoop.com#HADOOP.COM. If we configure hadoop to use "HTTP/hadoopmaster.idm.com#HADOOP.COM" then ui is accessible.
I have added both FQDNs to /etc/hosts file.
It seems when I do curl using any of the FQDNs I got the HTTP ticket of the first entry in /etc/hosts file.
For example if
...
10.7.0.5 hadoopmaster.idm.com
10.7.0.5 master.hadoop.com
...
now if i do curl i will get HTTP/hadoopmaster.idm.com#HADOOP.COM in klist.
and if /etc/hosts looks like this
...
10.7.0.5 master.hadoop.com
10.7.0.5 hadoopmaster.idm.com
...
Now if i do curl i will get HTTP/master.hadoop.com in klist
In both the cases if i configure the hadoop property to the same i got using curl then UI will be accessible and other wise it will shows 403 GSSException which i guess means curl used spnego but didn't get valid credentials.
And if it matches with the hadoop's configured principal then it will work.
It looks like Hostname is causing problem is there any way to map this hostname or is there any kerberos config which can map this or any property which will give me exact ticket with exact hostname i have specified in curl despite of hadoop configurations.

How do I upload a file to Azurite from terminal?

I'm using Azurite and wish to create a container/upload a blob etc from the bash terminal!
I've tried using the Azure CLI like this::
az storage container create --account-name devstoreaccount1 --account-key Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw== --name mycontainer
But of course it doesn't work and complains of Authentication failure! By the way the correct account key and name are used in that example.
I believe it's not possible to talk to Azurite using the Azure CLI.
All I want to do is create a container and upload a file to it from the terminal.
Does anybody know if this is possible? Or will I have to use a Java client (for example) to do the job?
Thanks
According to my test, when we account key and account name with Azure CLI to create blob container, cli will use https protocol to connect Azurite. But, in default, Azurite just support http protocol. For more details, please refer to here
So I suggest you use connection string to connect Azurite with Azure CLI, the connection string will tell Azure CLI uses http protocol.
For example
Create contanier
az storage container create -n test --connection-string "DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;QueueEndpoint=http://127.0.0.1:10001/devstoreaccount1;"
Upload file
az storage blob upload -f D:\test.csv -c test -n test.csv --connection-string "DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;QueueEndpoint=http://127.0.0.1:10001/devstoreaccount1;"

Hashicorp-vault userpass authentication

I am trying to use Hashicorp vault for storing secrets for service accounts username and passwords --- I am following this link https://www.vaultproject.io/docs/auth/userpass.html to create the user name and password.
My question here is that here as per this example I am specifying the password "foo" when i curl this from any ec2 instances, as part of automations, so we want to automated this and codes will come from git:
curl \
--request POST \
--data '{"password": "foo"}' \
http://10.10.218.10:8200/v1/auth/userpass/login/mitchellh
Our policy is that we should NOT store any password in git... How do I run this curl and get authenticated to vault without specify the password for the user? is this possible?
Why you don't want to use aws-auth-method?
Also, if you are sure to want to use password authentication I think you can do something like this:
Generate user/password in the Vault, store user passwords in the Vault and set a policy to allow reading specific user password for specific ec2-instance (EC2 auth method);
In the ec2-instance run consul-template which will authenticate in the Vault with an ec2-instance role;
This consul-template will generate curl command with specific user name and password
Use this command

Downloading existing key-pair (pem) file for ECS Instance Alibaba

I am working on a clients project and they have Magento installed on their EC2 instance, in order to ssh into it I need to have the pem file that was generated at the time of setting the key-pair. However I am not able to receive the pem file from their end and I am instead looking for a way to download the existing one. Is it even possible? Or do I create a new key-pair.
I wrote an article about Alibaba SSH Keypairs. If the keypair has been lost, you can replace it if you have Alibaba Cloud credentials (AccessKey and AccessKeySecret). This link to my article goes into specific details.
Alibaba Cloud SSH & ECS KeyPairs
The following commands require that the Alibaba Command Line CLI (aliyuncli) is installed and setup. I would backup (snapshot) the system before making the following changes.
This command will create a new Keypair called "NewKeyPair"
aliyuncli ecs CreateKeyPair --RegionId us-west-1 --KeyPairName NewKeyPair
This command will replace the current keypair with NewKeyPair (Windows syntax).
aliyuncli ecs AttachKeyPair --InstanceIds "[\"i-abcdeftvgllm854abcde\"]" --KeyPairName NewKeyPair
No, you can't download existing key. In order to connect to the server via ssh, you need the key which is generated at the time of server development. You can ask your clients for the key.

How secure is Local Hadoop Installation without password?

I want to install hadoop 2.6 in pseudo-distributed mode on my Mac following the instruction found in the blog http://zhongyaonan.com/hadoop-tutorial/setting-up-hadoop-2-6-on-mac-osx-yosemite.html
The blogger suggests to execute the commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
to allow ssh connection to localhost without password. I don't know anything about ssh, sorry for the very basic following concern. Can anyone please tell me:
Is it secure to run these command? Or I am granting any kind of public remote access to my pc? (I told you it was a very basic question)
How can I undo the authorisation I previously granted with these commands?
First and foremost, no Hadoop is secure without Kerberos. That's not closely related to what you're doing generating SSH keys.
In any case, SSH keys require you to have both a public and private key. No one can access the cluster without the generated private key. And no one can access the cluster if their key isn't in the authorized file.
To put it simply, the commands are only as secure as the computer you're running them on. For example, some bad actor could be remotely coping all generated SSH keys on the system.
These passwordless SSH keys are for the hadoop services to communicate between each other within the cluster, and each process should be ran with limited system access anyway, not elevated / root privileges.
You undo the operation by ultimately destroying the key, but you can prevent access by just removing the entry from the authorized file

Resources