Where does YARN application logs get stored in EMR before sending to S3 - hadoop

I have a requirement to write Yarn application logs from EMR to different source other than S3 .. Can you please lep me where does applications logs get saved in EMR master instance

If the application is submitted to the emr as a step then the logs will reside in:
/var/log/hadoop/steps/<<step-id>>/<<log-file>>
most logs for emr can be found under the /var/logs directory in the master node
you could also use the yarn cli to get the application logs and redirect the returned log stream to a file to do whatever you want with.
yarn logs -applicationId <<application_id>> > application_log_file.log

Yarn logs are found at /var/log/hadoop-yarn/, and yarn container logs are found at /var/log/hadoop-yarn/container
Links:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html

Related

Get list of executed job on Hadoop cluster after cluster reboot

I have a hadoop cluster 2.7.4 version. Due to some reason, I have to restart my cluster. I need job IDs of those jobs that were executed on cluster before cluster reboot. Command mapred -list provide currently running of waiting jobs details only
You can see a list of all jobs on the Yarn Resource Manager Web UI.
In your browser go to http://ResourceManagerIPAdress:8088/
This is how the history looks on the Yarn cluster I am currently testing on (and I restarted the services several times):
See more info here

Getting "User [dr.who] is not authorized to view the logs for application <AppID>" while running a YARN application

I'm running a custom Yarn Application using Apache Twill in HDP 2.5 cluster, but I'm not able to see my own container logs (syslog, stderr and stdout) when I go to my container web page:
Also the login changes from my kerberos to "dr.who" when I navigate to this page.
But I can see the logs of map-reduce jobs. Hadoop version is 2.7.3 and the cluster is yarn acl enabled.
i had this issue with hadoop ui. I found in this doc, that the hadoop.http.staticuser.user is set to dr.who by default and you need include it in the related setting file (in my issue is core-site.xml file).
so late but hope useful.

Transferring scripts from s3 to emr master

I've managed to get data files distributed on emr clusters, but can't get the simple python scripts copied over to the master instance to run the hadoop job.
Using aws cli (aws s3 cp s3://the_bucket/the_script.py .) returns
A client error (Forbidden) occurred when calling the HeadObject operation: Forbidden.
I tried starting emr clusters from the console, checking default in the IAM roles section,
I've setup two IAM roles EMR_DefaultRole , EMR_EC2_DefaultRole making sure they had all s3 access permissions available,
and I've made sure to run aws configure for both ec2-user and hadoop (confirming the right creds were in ~/.aws/config).
Still get the error above. If the hadoop user can distcp the data from the same s3 bucket that holds my python scripts, shouldn't hadoop user be able to copy those scripts using aws s3? Isn't the same user (hadoop) accessing the same bucket? Thanks for any pointers.

YARN log aggregation on AWS EMR - UnsupportedFileSystemException

I am struggling to enable YARN log aggregation for my Amazon EMR cluster. I am following this documentation for the configuration:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html#emr-plan-debugging-logs-archive
Under the section titled: "To aggregate logs in Amazon S3 using the AWS CLI".
I've verified that the hadoop-config bootstrap action puts the following in yarn-site.xml
<property><name>yarn.log-aggregation-enable</name><value>true</value></property>
<property><name>yarn.log-aggregation.retain-seconds</name><value>-1</value></property>
<property><name>yarn.log-aggregation.retain-check-interval-seconds</name><value>3000</value></property>
<property><name>yarn.nodemanager.remote-app-log-dir</name><value>s3://mybucket/logs</value></property>
I can run a sample job (pi from hadoop-examples.jar) and see that it completed successfully on the ResourceManager's GUI.
It even creates a folder under s3://mybucket/logs named with the application id. But the folder is empty, and if I run yarn logs -applicationID <applicationId>, I get a stacktrace:
14/10/20 23:02:15 INFO client.RMProxy: Connecting to ResourceManager at /10.XXX.XXX.XXX:9022
Exception in thread "main" org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: s3
at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:154)
at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:242)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:333)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:330)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:330)
at org.apache.hadoop.fs.FileContext.getFSofPath(FileContext.java:322)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:85)
at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1388)
at org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:112)
at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:137)
at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:199)
Which is doesn't make any sense to me; I can run hdfs dfs -ls s3://mybucket/ and it lists the contents just fine. The machines are getting credentials from AWS IAM Roles, I've tried adding fs.s3n.awsAccessKeyId and such to core-site.xml with no change in behavior.
Any advice is much appreciated.
Hadoop provides two fs interfaces - FileSystem and AbstractFileSystem. Most of the time, we work with FileSystem and use configuration options like fs.s3.impl to provide custom adapters.
yarn logs, however, uses the AbstractFileSystem interface.
If you can find an implementation of that for S3, you can specify it using fs.AbstractFileSystem.s3.impl.
See core-default.xml for examples of fs.AbstractFileSystem.hdfs.impl etc.

Storm logviewer page not found

I'm able to submit a topology job in the multi-tenant cluster. The job is running. However, the logviewer page is not available. Is there any way to solve this issue?
you need to start the logviewer before you click on topology port to see logviewer.
To start logviewer run:
$ storm logviewer same as you run $ storm list
I faced the same issue for logviewer's home page, but directly navigating to a particular log file that exists in the logs folder works. Try this:
MachineIP:8000/log?file=worker-6700.log

Resources