YARN log aggregation on AWS EMR - UnsupportedFileSystemException - hadoop

I am struggling to enable YARN log aggregation for my Amazon EMR cluster. I am following this documentation for the configuration:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html#emr-plan-debugging-logs-archive
Under the section titled: "To aggregate logs in Amazon S3 using the AWS CLI".
I've verified that the hadoop-config bootstrap action puts the following in yarn-site.xml
<property><name>yarn.log-aggregation-enable</name><value>true</value></property>
<property><name>yarn.log-aggregation.retain-seconds</name><value>-1</value></property>
<property><name>yarn.log-aggregation.retain-check-interval-seconds</name><value>3000</value></property>
<property><name>yarn.nodemanager.remote-app-log-dir</name><value>s3://mybucket/logs</value></property>
I can run a sample job (pi from hadoop-examples.jar) and see that it completed successfully on the ResourceManager's GUI.
It even creates a folder under s3://mybucket/logs named with the application id. But the folder is empty, and if I run yarn logs -applicationID <applicationId>, I get a stacktrace:
14/10/20 23:02:15 INFO client.RMProxy: Connecting to ResourceManager at /10.XXX.XXX.XXX:9022
Exception in thread "main" org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: s3
at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:154)
at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:242)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:333)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:330)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:330)
at org.apache.hadoop.fs.FileContext.getFSofPath(FileContext.java:322)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:85)
at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1388)
at org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:112)
at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:137)
at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:199)
Which is doesn't make any sense to me; I can run hdfs dfs -ls s3://mybucket/ and it lists the contents just fine. The machines are getting credentials from AWS IAM Roles, I've tried adding fs.s3n.awsAccessKeyId and such to core-site.xml with no change in behavior.
Any advice is much appreciated.

Hadoop provides two fs interfaces - FileSystem and AbstractFileSystem. Most of the time, we work with FileSystem and use configuration options like fs.s3.impl to provide custom adapters.
yarn logs, however, uses the AbstractFileSystem interface.
If you can find an implementation of that for S3, you can specify it using fs.AbstractFileSystem.s3.impl.
See core-default.xml for examples of fs.AbstractFileSystem.hdfs.impl etc.

Related

Where does YARN application logs get stored in EMR before sending to S3

I have a requirement to write Yarn application logs from EMR to different source other than S3 .. Can you please lep me where does applications logs get saved in EMR master instance
If the application is submitted to the emr as a step then the logs will reside in:
/var/log/hadoop/steps/<<step-id>>/<<log-file>>
most logs for emr can be found under the /var/logs directory in the master node
you could also use the yarn cli to get the application logs and redirect the returned log stream to a file to do whatever you want with.
yarn logs -applicationId <<application_id>> > application_log_file.log
Yarn logs are found at /var/log/hadoop-yarn/, and yarn container logs are found at /var/log/hadoop-yarn/container
Links:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html

Insufficient number of DataNodes reporting when creating dataproc cluster

I am getting "Insufficient number of DataNodes reporting" error when creating dataproc cluster with gs:// as default FS. Below is the command i am using dataproc cluster.
gcloud dataproc clusters create cluster-538f --image-version 1.2 \
--bucket dataproc_bucket_test --subnet default --zone asia-south1-b \
--master-machine-type n1-standard-1 --master-boot-disk-size 500 \
--num-workers 2 --worker-machine-type n1-standard-1 --worker-boot-disk-size 500 \
--scopes 'https://www.googleapis.com/auth/cloud-platform' --project delcure-firebase \
--properties 'core:fs.default.name=gs://dataproc_bucket_test/'
I checked and confirmed that the bucket i am using is able to create default folder in the bucker.
As Igor suggests, Dataproc does not support GCS as a default FS. I also suggest unsetting this property. Note, that fs.default.name property can be passed to individual jobs and will work just fine.
The error arises when the file system is tried to be accessed (HdfsClientModule). So, I think it is probable that Google Cloud Storage doesn't have a specific feature that is required for Hadoop and the creation fails after some folders were created (first image).
As somebody else mentioned previously, it is better to give up the idea of using GCS as the default fs and leave HDFS work in Dataproc. Nonetheless, you can still take advantage of Cloud Storage to have data persistence, reliability, and performance because remember that data in HDFS is removed when a cluster is shut down.
1.- From a Dataproc node you can access data through the hadoop command to move data in and out, for example:
hadoop fs -ls gs://CONFIGBUCKET/dir/file
hadoop distcp hdfs://OtherNameNode/dir/ gs://CONFIGBUCKET/dir/file
2.- For accessing data from Spark or any Hadoop application just use the gs:// prefix to access your bucket.
Furthermore, if the Dataproc connector is installed on premises it can help to move HDFS data to Cloud Storage and then access it from a Dataproc cluster.

Hadoop job submission using Apache Ignite Hadoop Accelerators

Disclaimer: I am new to both Hadoop and Apache Ignite. sorry for the lengthy background info.
Setup:
I have installed and configured Apache Ignite Hadoop Accelerator. Start-All.sh brings up the below services. I can submit Hadoop jobs. They complete and I can see results as expected. The start all uses traditional core-site, hdfs-site, mapred-site, and yarn-site configuration files.
28336 NodeManager
28035 ResourceManager
27780 SecondaryNameNode
27429 NameNode
28552 Jps
27547 DataNode
I also have installed Apache Ignite 2.6.0. I am able to start ignite nodes, connect to it using web console. I was able to load the cache from MySQL and run SQL queries and java programs against this cache.
For running Hadoop jobs using ignited Hadoop, I created a separate ignite-config directory, in which I have customized core-site and mapred-site configurations as per the instructions in the Apache ignite web site.
Issue:
When I run a Hadoop job using the command:
hadoop --config ~/ignite-conf jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.0.jar wordcount input output1
I get the below error (Note, the same job ran successfully against the Hadoop/without ignite):
java.io.IOException: Failed to get new job ID.
...
...
Caused by: class org.apache.ignite.internal.client.GridClientDisconnectedException: Latest topology update failed.
...
...
Caused by: class org.apache.ignite.internal.client.GridServerUnreachableException: Failed to connect to any of the servers in list: [/:13500]
...
...
It looks like, there was attempt made to lookup the jobtracker (13500) and it was not able to find. From the service list above, it's obvious that job tracker is not running. However, the job ran just fine on non-ignited hadoop over YARN.
Can you help please?
This is resolved in my case.
The job tracker here meant the Apache Ignite memory cache services listening on port 11211.
After making this change in mapred-site.xml, the job ran!

Transferring scripts from s3 to emr master

I've managed to get data files distributed on emr clusters, but can't get the simple python scripts copied over to the master instance to run the hadoop job.
Using aws cli (aws s3 cp s3://the_bucket/the_script.py .) returns
A client error (Forbidden) occurred when calling the HeadObject operation: Forbidden.
I tried starting emr clusters from the console, checking default in the IAM roles section,
I've setup two IAM roles EMR_DefaultRole , EMR_EC2_DefaultRole making sure they had all s3 access permissions available,
and I've made sure to run aws configure for both ec2-user and hadoop (confirming the right creds were in ~/.aws/config).
Still get the error above. If the hadoop user can distcp the data from the same s3 bucket that holds my python scripts, shouldn't hadoop user be able to copy those scripts using aws s3? Isn't the same user (hadoop) accessing the same bucket? Thanks for any pointers.

How do you establish single node Hadoop instance on AWS using Apache Whirr?

I am attempting to run a single-node instance of Hadoop on Amazon Web Services using Apache Whirr. I set whirr.instance-templates equal to 1 jt+nn+dn+tt. The instance starts up fine. I am able to create directories, but when I try to put files, I get a File could only be replicated to 0 nodes, instead of 1 error. When I do a hadoop fsck / I get a Exception in thread "main" java.net.ConnectException: Connection refused error. Does anyone know what is wrong with my configuration?
I made the experience that whirr does not always start all services reliable. It sounds like the namenode started (the namenode is responsible for storing directory information) but the datanode did not start (the datanode stores the data).
Try running
hadoop dfsadmin -report
to see if a datanode is available.
If not: often it helps to restart the cluster.

Resources