Amazon EMR Application Master web UI? - hadoop

I have started running PIG jobs on Amazon EMR using Hadoop YARN (AMI 3.3.1) however as there is no longer a job tracker in Yarn, I can't seem to be able to find a web UI so that I can track the number of Mappers and Reducers for a MapReduce job, when I try to access the Application Master link provided in the resource manager UI page, I am told that the page doesn't exists (Picture provided below).
Does anyone know how I can access a UI through my web browser that will show me the current job status in terms of number of mappers, reducers and the % completed for each etc?
Thanks

Once you click the ApplicationMaster link from ResourceManager webpage, you'll be redirected to ApplicationMaster web ui; as EMR uses EC2 instances and each EC2 instance has 2 IP addresses associated with it, one used for private communication and another for public. EMR uses private ip addresses (private DNS) to setup hadoop hence, you'll be redirected to a url like this:
http://10.204.137.136:9046/proxy/application_1423027388806_0003/
which you could see is pointing to instance's private ip address and hence your browser cannot resolve the ip address, you just have to replace the private ip address with the public ip address (or public dns name) of that instance:
Obtaining the public ip address of an instance
Using the EC2 web interface
You could login to the AWS EC2 console and find the instance's ip address's
Using the console:
If you are logged into the instance and want to know it's public ip address then issue the following command which will give you back the public ip address of that instance.
curl http://169.254.169.254/latest/meta-data/public-ipv4
Also take a look at this AWS documentation page on how to view web interfaces which provides other options like setting up SSH tunneling and using SOCKS proxy.

Related

Time Service on Cassandra node in Private VPC Segment

I've installed 5 nodes on a private segment of an Amazon VPC. I'm receiving the following error when the nodes start:
These notices occurred during the startup of this instance:
[ERROR] 09/23/15-13:48:03 sudo ntpdate pool.ntp.org:
[WARN] publichostname not available as metadata
[WARN] publichostname not available as metadata
I was able to reaach out (through our NAT server) on port 80 to perform updates and log in to datastax. We're not currently using any expiration times in the schemas. I set the machines up without a public hostname,since they were only accessible through an API or by those of us in the VPN. All of the nodes are in the same availability zone, but eventually we will want to have nodes in a different zone in the same region.
My questions are:
Is this a critical error?
Should I have a public hostname on the
nodes?
Should they be on a public subnet (I would think not for
security purposes)?
Thanks in advance.
I found in this:
https://github.com/riptano/ComboAMI/blob/2.5/ds2_configure.py#L136-L147
It seems to be the source of this message, and if it that's the case, it seems harmless -- a lookup of the instance's IP address is used instead of the hostname.
If you aren't familiar with it, http://169.254.169.254/ as you will see in the code is a web server inside the EC2 infrastructure that provides an easy way to access metadata about the instance. The metadata is specific to the instance making the request, and the IP address doesn't change.

Access hadoop nodes web UI from multiple links

i am using the following setup for hadoop's nodes web ui access :
dfs.namenode.http-address : 127.0.0.1:50070
By which i am able to access the nodes web ui link only form the local machine as :
http://127.0.0.1:50070
Is there any way by which i can make it accessible from outside as well ? say like :
http://<Machine-IP>:50070
Thanks in Advance !!
You can use hostname or ipaddress instead of localhost/127.0.0.1.
Make sure you can ping the hostname or ip from the remote machine. If you can ping it then you can able to access web ui.
To ping it
Open cmd/terminal
type the below command in remote machines
ping hostname/ip
From http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html
The following table lists web interfaces that you can view on the core
and task nodes. These Hadoop interfaces are available on all clusters.
To access the following interfaces, replace slave-public-dns-name in
the URI with the public DNS name of the node. For more information
about retrieving the public DNS name of a core or task node instance,
see Connecting to Your Linux/Unix Instances Using SSH in the Amazon
EC2 User Guide for Linux Instances. In addition to retrieving the
public DNS name of the core or task node, you must also edit the
ElasticMapReduce-slave security group to allow SSH access over TCP
port 22. For more information about modifying security group rules,
see Adding Rules to a Security Group in the Amazon EC2 User Guide for
Linux Instances.
YARN ResourceManager
YARN NodeManager
Hadoop HDFS NameNode
Hadoop HDFS DataNode
Spark HistoryServer
Because there are several application-specific interfaces available on
the master node that are not available on the core and task nodes, the
instructions in this document are specific to the Amazon EMR master
node. Accessing the web interfaces on the core and task nodes can be
done in the same manner as you would access the web interfaces on the
master node.
There are several ways you can access the web interfaces on the master
node. The easiest and quickest method is to use SSH to connect to the
master node and use the text-based browser, Lynx, to view the web
sites in your SSH client. However, Lynx is a text-based browser with a
limited user interface that cannot display graphics. The following
example shows how to open the Hadoop ResourceManager interface using
Lynx (Lynx URLs are also provided when you log into the master node
using SSH).
Copy lynx http://ip-###-##-##-###.us-west-2.compute.internal:8088/
There are two remaining options for accessing web interfaces on the
master node that provide full browser functionality. Choose one of the
following:
Option 1 (recommended for more technical users): Use an SSH client to connect to the master node, configure SSH tunneling with local port
forwarding, and use an Internet browser to open web interfaces hosted
on the master node. This method allows you to configure web interface
access without using a SOCKS proxy.
to do this use the command
$ ssh -gnNT -L 9002:localhost:8088 user#example.com
where user#example.com is your username. Note the use of -g to open access to external ip addresses (beware this is a security risk)
you can check this is running using
nmap localhost
to close this ssh tunnel when done use
ps aux | grep 9002
to find the pid of your running ssh process and kill it.
Option 2 (recommended for new users): Use an SSH client to connect to the master node, configure SSH tunneling with dynamic port
forwarding, and configure your Internet browser to use an add-on such
as FoxyProxy or SwitchySharp to manage your SOCKS proxy settings. This
method allows you to automatically filter URLs based on text patterns
and to limit the proxy settings to domains that match the form of the
master node's DNS name. The browser add-on automatically handles
turning the proxy on and off when you switch between viewing websites
hosted on the master node, and those on the Internet. For more
information about how to configure FoxyProxy for Firefox and Google
Chrome, see Option 2, Part 2: Configure Proxy Settings to View
Websites Hosted on the Master Node.
This seems like insanity to me but I have been unable to find how to configure access in core-site.xml to override the web interface for the ResourceManager which by default it is available at localhost:8088/ and if Amazon think this is the way then I tend to go along with it

Hadoop 2.2 on EC2 Web interface: "Browse the filesystem" and Datanode Links broken

I have configured a single node cluster on Amazon EC2 instance (ubuntu-trusty-14.04-amd64-server-20140927 (ami-3d50120d)). Once I start the Hadoop cluster, I visit the NameNode web interface(http://ec2-xx-xx-xx-xx.us-west-2.compute.amazonaws.com:50070/dfshealth.jsp) which works fine. But when navigating to the link that says "Browse the fileSystem" the link is broken and points to http://ip-xxx-xx-xx-xxx.us-west-2.compute.internal:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=/&nnaddr=172.31.25.124:9000 - which is this instances private ip. For this also occurs when visiting the datanode like under "Live nodes".
Somehow, these links are being resolved to the private ip address of my instance. If I replace the url with the public dns of my instance these pages load correctly. Has anyone seen and better yet solved this issue?
Try using the fully qualified host names in Hadoops configs. I think you need to change core-site.xml and hdfs-site.xml to your public DNS names.
Similar issue
Use a socks proxy together with a proxy configuration tool. The instructions for EMR should work the same for an ec2 hadoop deployment.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-proxy.html

Private networking necessary for Mesos and Marathon?

I am working through this tutorial: http://mesosphere.io/docs/getting-started/cloud-install/
Just learning on an Ubuntu instance on Digital Ocean, I let the master process bind to the public IP, and the Mesos and Marathon web interfaces became publicly accessible. No surprises there.
Do Mesos and Marathon rely on Zookeeper to create private IPs between instances? Could you skip using Zookeeper by manually setting up a private network between instances? Then the proper way to start the master and slave processes is to bind to the secondary, private IPs of each instance?
Digital Ocean can set up private IPs automatically, but this is kind of a learning exercise for me. I am aware of the broad rule that administrator access to a server shouldn't come through a public IP. Another way of phrasing this posting is, does private networking provide the security for Mesos and Marathon?
Only starting with one Ubuntu instance, running both master and slave, for now. Binding to the loopback address would fix this issue for just one machine, I realize.
ZooKeeper is used for a few different things for both Marathon and Mesos:
Leader election
Storing state
Resolving the Mesos masters
At the moment, you can't skip ZooKeeper entirely because of 2 and 3 (although later versions of Mesos have their own registry which keeps track of state). AFAIK, Mesos doesn't rely on ZooKeeper for creation of private IPs - it'll bind to whatever is available (but you can force this via the ip parameter). So, you won't be able to forgo ZooKeeper entirely with a private network.
Private networking will provide some security for Mesos and Marathon - assuming you firewall off their access to the external world.
A good (although not necessarily the best) solution for keeping the instances on a private network is to set up an OpenVPN (or similar) network to one of the masters. Then, launch each instance on its private IP and make you also set the hostname parameter to that IP. Connect to the Mesos/Marathon web consoles via their private IP and the VPN and all should resolve correctly.
Mesos and marathon doesn't create private IPs between instance.
For that, I suggest you use tinc or directly a docker image tinc
Using this, I was able to do the config you want in 5 minutes, it's easier to configure than openvpn, and each host can connect to another, no need to use a vpn server to route all the traffic.
Each node will store a private and public for connecting to each server of the private network.
You should setup a private network for using mesos.
After that, you can add in /etc/hosts all the hosts with the IP of the internal network.
You will be able to bind zookeeper using the private network :
zk://master-1:2181,master-2:2181,master-3:2181
Then the proper way to start the master and slave processes is to bind to the secondary private IPs of each instance.

Can an Amazon EC2 Instance access another Instance by Private IP?

I have two separate instances in my test scenario
Web Server Instance
Database Server Instance
So far the only way I can get from 1st to 2nd Instance is by having Elastic IP's configured and using the Public DNS (or IP) reference. I can limit unwanted access by configuring the Security Group for 2nd to only take Port 1433 traffic only from 1st.
It seems like Instances within the same Amazon AWS zone should be able to talk to each other more efficiently than first going out and then coming back in.
Is there a way to go directly from 1st to 2nd instance using just the Private DNS (or IP)?
If you are using the Amazon Public DNS name, Amazon makes sure that all internal traffic gets routed internally only. So there is no problem in using the public DNS names. Have a look at this question and this article for more details.

Resources