How to access aws public dataset on S3? - hadoop

I am trying to load public data using pig from s3 using this url
s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/4gram/data
LOAD 's3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/4gram/data'
but it is asking for access and secret key. Should I move this data to one of my buckets? or am I missing something

Public data sets are also accessible only when you have AWS account. Data sets are visible to every one on AWS. Hence you need to pass credentials - access key and secret key in this case.

Related

How to rotate IAM user access keys

I am trying to rotate the user access keys & secret keys for all the users, last time when it was required I did it manually but now I want to do it by a rule or automation
I went through some links and found this link
https://github.com/miztiik/serverless-iam-key-sentry
with this link, I tried to use but I was not able to perform the activity, it was always giving me the error, can anyone please or suggest any better way to do it?
As I am new to aws lamda also I am not sure that how my code can be tested?
There are different ways to implements a solution. One common way you can automate this is through a storing the IAM user access keys in Secret Manager for safely storing the keys. Next, you could configure a monthly or 90 days check to rotate the keys utilizing the AWS CLI and store the new keys within AWS Secrets Manager. You could use an SDK of your choice for this.

Determine S3 Bucket Region

I'm using the AWS Ruby SDK v2 to access various buckets across a couple of regions. Is it possible to determine the region of (at runtime) of each bucket before I access it, so I can avoid the error below, which I get if I configure the AWS S3 client with the wrong region?
The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.
I know I can shell out and use the command below and parse the response, but ideally, I want to stay within the Ruby SDK.
aws s3api get-bucket-location
I couldn't find any official document for this, but from the aws-sdk spec you should be able to use following code to get the region
client = Aws::S3::Client.new()
resp = client.get_bucket_location(bucket: bucket_name)
s3_region = resp.data.location_constraint
This one is calling the same API as aws s3api get-bucket-location
For aws-sdk (2.6.5), it is:
client.get_bucket_location(bucket: bucket_name).location_constraint
But then, now, how we get a list of buckets that belong to a specific region?

Is there a way to create nodes from cloud formation that can ssh to each other without passwords?

I am creating up an AWS Cloud formation template which sets up a set of nodes which must allow keyless ssh login amongst themselves. i.e. One controller must be able to login to all slaves with its private key. The controllers private key is generated dynamically so I do not have access to be able to hard code it into the User-Data of the Template or pass it as a parameter to the template.
Is there a way in Cloud Formation templates to add the controller's public key to slave nodes' authorized keys files?
Is there some other way to use security groups or IAMS to do what is required?
You have to pas the Public key o the master server to the slave nodes in the form of user-data. Cloudformation does support user-data. You may have to figure out the syntax for the same.
In other words, consider it as a simple bash script which will copy the master servers's public key to the slaves. and then you pass this bash script as suer-data so that it gets executed for the 1st time the instance is created.
You will find tons of goggle searches on above information.
I would approach this problem with IAM machine roles. You can grant specific machines certain AWS rights. IAM roles do not apply to ssh access, but to AWS api calls, like s3 bucket access or creating ec2 instances.
Therefore, a solution might look like:
Create a controller machine role which can write to a particular S3 bucket.
Create a slave machine role which can read from that bucket.
Have the controller create an upload a public key into the bucket.
Since you don't know if the controller is created before the slaves, you'll have to have cloud-init set up a cron job every couple minutes that downloads the key from the bucket if it hasn't done so yet.

Is there anyway to get the user data of a running EC2 instance via AWS SDK?

I've tried to use DescribeInstances but I've found the response result does not contain user data. Is there any way to retrieve this user data?
My usage case is I've trying to request spot instances and assign different user data to each EC2 instance for some kind of automation and then I want to tag the name of each instance according to this user data. Based on my understanding, creating a tag request requires InstanceId, which is not available at the time when I make a request to reserve a spot instance.
So I'm wondering whether there is any way to get the user data of a running instance instead of SSHing the instance...
The DescribeInstanceAttributes endpoint will provide you with user data.
http://docs.aws.amazon.com/AWSEC2/latest/APIReference/ApiReference-query-DescribeInstanceAttribute.html

What is a good way to access external data from aws

I would like to access external data from my aws ec2 instance.
In more detail: I would like to specify inside by user-data the name of a folder containing about 2M of binary data. When my aws instance starts up, I would like it to download the files in that folder and copy them to a specific location on the local disk. I only need to access the data once, at startup.
I don't want to store the data in S3 because, as I understand it, this would require storing my aws credentials on the instance itself, or passing them as userdata which is also a security risk. Please correct me if I am wrong here.
I am looking for a solution that is both secure and highly reliable.
which operating system do you run ?
you can use an elastic block storage. it's like a device you can mount at boot (without credentials) and you have permanent storage there.
You can also sync up instances using something like Gluster filesystem. See this thread on it.

Resources