Run Apache Flink with Amazon S3 - hadoop

Does someone succeed to use Apache Flink 0.9 to process data stored on AWS S3? I found they are using own S3FileSystem instead of one from Hadoop... and it looks like it doesn't work.
I put the following path s3://bucket.s3.amazonaws.com/folder
it's failed with the following exception:
java.io.IOException: Cannot establish connection to Amazon S3:
com.amazonaws.services.s3.model.AmazonS3Exception: The request
signature we calculated does not match the signature you provided.
Check your key and signing method. (Service: Amazon S3; Status Code:
403;

Update May 2016: The Flink documentation now has a page on how to use Flink with AWS
The question has been asked on the Flink user mailing list as well and I've answered it over there: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Processing-S3-data-with-Apache-Flink-td3046.html
tl;dr:
Flink program
public class S3FileSystem {
public static void main(String[] args) throws Exception {
ExecutionEnvironment ee = ExecutionEnvironment.createLocalEnvironment();
DataSet<String> myLines = ee.readTextFile("s3n://my-bucket-name/some-test-file.xml");
myLines.print();
}
}
Add the following to core-site.xml and make it available to Flink:
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>putKeyHere</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>putSecretHere</value>
</property>
<property>
<name>fs.s3n.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>

you can retrieve the artifacts from the S3 bucket that is specified in the output section of the CloudFormation template.
i.e. After the Flink runtime is up and running, the taxi stream processor program can be submitted to the Flink runtime to start the real-time analysis of the trip events in the Amazon Kinesis stream.
$ aws s3 cp s3://«artifact-bucket»/artifacts/flink-taxi-stream-processor-1.0.jar .
$ flink run -p 8 flink-taxi-stream-processor-1.0.jar --region «AWS region» --stream «Kinesis stream name» --es-endpoint https://«Elasticsearch endpoint»
Both of the above commands use Amazon's S3 as source, you have to specify the artifact name accordingly.
Note: you can follow the link below and make a pipeline using EMR and S3 buckets.
https://aws.amazon.com/blogs/big-data/build-a-real-time-stream-processing-pipeline-with-apache-flink-on-aws/

Related

Programmatically configuring S3 options in Flink

Apparently Flink 1.14.0 doesn't correctly translate S3 options when they are set programmatically. I'm creating a local environment like this to connect to local MinIO instance:
val flinkConf = new Configuration()
flinkConf.setString("s3.endpoint", "http://127.0.0.1:9000")
flinkConf.setString("s3.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
val env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(flinkConf)
Then StreamingFileSink fails with a huge stack trace with most relevant messages being Caused by: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider InstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: Failed to connect to service endpoint: which means that Hadoop tried to enumerate all of the credential providers instead of using the one set in configuration. What am I doing wrong?
I've spent ages trying to figure out this one too. I could not find a way to set it programmatically, but adding the following to src/main/resources/core-site.xml in my Flink java project root worked in the end:
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>com.amazonaws.auth.profile.ProfileCredentialsProvider</value>
</property>
</configuration>
Then I could use AWS_PROFILE env var to select stored credentials.
This was for Flink with flink-s3-fs-hadoop 1.13.2

Transferring scripts from s3 to emr master

I've managed to get data files distributed on emr clusters, but can't get the simple python scripts copied over to the master instance to run the hadoop job.
Using aws cli (aws s3 cp s3://the_bucket/the_script.py .) returns
A client error (Forbidden) occurred when calling the HeadObject operation: Forbidden.
I tried starting emr clusters from the console, checking default in the IAM roles section,
I've setup two IAM roles EMR_DefaultRole , EMR_EC2_DefaultRole making sure they had all s3 access permissions available,
and I've made sure to run aws configure for both ec2-user and hadoop (confirming the right creds were in ~/.aws/config).
Still get the error above. If the hadoop user can distcp the data from the same s3 bucket that holds my python scripts, shouldn't hadoop user be able to copy those scripts using aws s3? Isn't the same user (hadoop) accessing the same bucket? Thanks for any pointers.

YARN log aggregation on AWS EMR - UnsupportedFileSystemException

I am struggling to enable YARN log aggregation for my Amazon EMR cluster. I am following this documentation for the configuration:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html#emr-plan-debugging-logs-archive
Under the section titled: "To aggregate logs in Amazon S3 using the AWS CLI".
I've verified that the hadoop-config bootstrap action puts the following in yarn-site.xml
<property><name>yarn.log-aggregation-enable</name><value>true</value></property>
<property><name>yarn.log-aggregation.retain-seconds</name><value>-1</value></property>
<property><name>yarn.log-aggregation.retain-check-interval-seconds</name><value>3000</value></property>
<property><name>yarn.nodemanager.remote-app-log-dir</name><value>s3://mybucket/logs</value></property>
I can run a sample job (pi from hadoop-examples.jar) and see that it completed successfully on the ResourceManager's GUI.
It even creates a folder under s3://mybucket/logs named with the application id. But the folder is empty, and if I run yarn logs -applicationID <applicationId>, I get a stacktrace:
14/10/20 23:02:15 INFO client.RMProxy: Connecting to ResourceManager at /10.XXX.XXX.XXX:9022
Exception in thread "main" org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: s3
at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:154)
at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:242)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:333)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:330)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:330)
at org.apache.hadoop.fs.FileContext.getFSofPath(FileContext.java:322)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:85)
at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1388)
at org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:112)
at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:137)
at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:199)
Which is doesn't make any sense to me; I can run hdfs dfs -ls s3://mybucket/ and it lists the contents just fine. The machines are getting credentials from AWS IAM Roles, I've tried adding fs.s3n.awsAccessKeyId and such to core-site.xml with no change in behavior.
Any advice is much appreciated.
Hadoop provides two fs interfaces - FileSystem and AbstractFileSystem. Most of the time, we work with FileSystem and use configuration options like fs.s3.impl to provide custom adapters.
yarn logs, however, uses the AbstractFileSystem interface.
If you can find an implementation of that for S3, you can specify it using fs.AbstractFileSystem.s3.impl.
See core-default.xml for examples of fs.AbstractFileSystem.hdfs.impl etc.

Which core-site.xml do I add my AWS access keys to?

I want to run Spark code on EC2 against data stored in my S3 bucket. According to both the Spark EC2 documentation and the Amazon S3 documentation, I have to add my AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to the core-site.xml file. However, when I shell into my master EC2 node, I see several core-site.xml files.
$ find . -name core-site.xml
./mapreduce/conf/core-site.xml
./persistent-hdfs/share/hadoop/templates/conf/core-site.xml
./persistent-hdfs/src/packages/templates/conf/core-site.xml
./persistent-hdfs/src/contrib/test/core-site.xml
./persistent-hdfs/src/test/core-site.xml
./persistent-hdfs/src/c++/libhdfs/tests/conf/core-site.xml
./persistent-hdfs/conf/core-site.xml
./ephemeral-hdfs/share/hadoop/templates/conf/core-site.xml
./ephemeral-hdfs/src/packages/templates/conf/core-site.xml
./ephemeral-hdfs/src/contrib/test/core-site.xml
./ephemeral-hdfs/src/test/core-site.xml
./ephemeral-hdfs/src/c++/libhdfs/tests/conf/core-site.xml
./ephemeral-hdfs/conf/core-site.xml
./spark-ec2/templates/root/mapreduce/conf/core-site.xml
./spark-ec2/templates/root/persistent-hdfs/conf/core-site.xml
./spark-ec2/templates/root/ephemeral-hdfs/conf/core-site.xml
./spark-ec2/templates/root/spark/conf/core-site.xml
./spark/conf/core-site.xml
After some experimentation, I determined that I can access an s3n url like s3n://mcneill-scratch/GR.txt from Spark only if I add my credentials to both mapreduce/conf/core-site.xml and spark/conf/core-site.xml.
This seems wrong to me. It's not DRY, and I can't find anything in the documentation that says you have to add your credentials to multiple files.
Is modifying multiple files the correct way to set of s3 credentials via core-site.xml? Is there documentation somewhere that explains this?
./spark/conf/core-site.xml should be the right place

Unable to copy NCDC Data from Amazon AWS to Hadoop Cluster

I am trying to copy the NCDC Data from Amazon S3 to my local hadoop cluster by using following command.
hadoop distcp -Dfs.s3n.awsAccessKeyId='ABC' -Dfs.s3n.awsSecretAccessKey='XYZ' s3n://hadoopbook/ncdc/all input/ncdc/all
And getting error which is given below :
java.lang.IllegalArgumentException: AWS Secret Access Key must be specified as the password of a s3n URL, or by setting the fs.s3n.awsSecretAccessKey property
Gone through the following question but of no big help.
Problem with Copying Local Data
Any hint about how to solve the problem . Detailed answer will be very appreciated for better understanding. Thanks
Have you tried this:
Excerpt from AmazonS3 Wiki
Here is an example copying a nutch segment named 0070206153839-1998 at
/user/nutch in hdfs to an S3 bucket named 'nutch' (Let the S3
AWS_ACCESS_KEY_ID be 123 and the S3 AWS_ACCESS_KEY_SECRET be 456):
% ${HADOOP_HOME}/bin/hadoop distcp
hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998
s3://123:456#nutch/
In you case, it should be something like this:
hadoop distcp s3n://ABC:XYZ#hadoopbook/ncdc/all hdfs://IPaddress:port/input/ncdc/all
You need to set up the aws id and password in the core-site.xml
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>xxxxxxx</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>xxxxxxxxx</value>
</property>
and restart your cluster

Resources