Airflow conn_id with multiple server - hadoop

I am using WebHDFSSensor and for that we need to provide namenode. However, active namenode and standBy namenode change. I can't just provide current namenode host to webhdfs_conn_id. I have to create connection from both host. I tried to provide host as an array but it didn't work.
So my question here is , Lets consider I need connection with name webhdfs_default and I need it for 2 host w.x.y.z and a.b.c.d. How do I create that?

Related

How to configure ports for hostname and localhost?

I am running a browser on the single node Hortonworks Hadoop cluster(HDP 2.3.4) on Centos 6.7:
With localhost:8000 and <hostname>:8000, I can access Hue. Same works for Ambari at 8080
However, several other ports, I only can access with the hostname. So with e.g. <hostname>:50070, I can access the namenode service. If I use localhost:50070, I cannot setup a connection. So I assume localhost is blocked, the namenode not.
How can I set up that localhost and <hostname> have the same port configuration?
This likely indicates that the NameNode HTTP server socket is bound to a single network interface, but not the loopback interface. The NameNode HTTP server address is controlled by configuration property dfs.namenode.http-address in hdfs-site.xml. Typically this specifies a host name or IP address, and this maps to a single network interface. You can tell it to bind to all network interfaces by setting property dfs.namenode.http-bind-host to 0.0.0.0 (the wildcard address, matching all network interfaces). The NameNode must be restarted for this change to take effect.
There are similar properties for other Hadoop daemons. For example, YARN has a property named yarn.resourcemanager.bind-host for controlling how the ResourceManager binds to a network interface for its RPC server.
More details are in the Apache Hadoop documentation for hdfs-default.xml and yarn-default.xml. There is also full coverage of multi-homed deployments in HDFS Support for Multihomed Networks.

Apache Spark error : Could not connect to akka.tcp://sparkMaster#

This is our first steps using big data stuff like apache spark and hadoop.
We have a installed Cloudera CDH 5.3. From the cloudera manager we choose to install spark. Spark is up and running very well in one of the nodes in the cluster.
From my machine I made a little application that connects to read a text file stored on hadoop HDFS.
I am trying to run the application from Eclipse and it displays these messages
15/02/11 14:44:01 INFO client.AppClient$ClientActor: Connecting to master spark://10.62.82.21:7077...
15/02/11 14:44:02 WARN client.AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#10.62.82.21:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#10.62.82.21:7077
15/02/11 14:44:02 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#10.62.82.21:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: no further information: /10.62.82.21:7077
The application is has one class the create a context using the following line
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("Spark Count").setMaster("spark://10.62.82.21:7077"));
where this IP is the IP of the machine spark working on.
Then I try to read a file from HDFS using the following line
sc.textFile("hdfs://10.62.82.21/tmp/words.txt")
When I run the application I got the
Check your Spark master logs, you should see something like:
15/02/11 13:37:14 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster#mymaster:7077]
15/02/11 13:37:14 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkMaster#mymaster:7077]
15/02/11 13:37:14 INFO Master: Starting Spark master at spark://mymaster:7077
Then when your connecting to the master, be sure to use exactly the same hostname as found in the logs above (do not use the IP address):
.setMaster("spark://mymaster:7077"));
Spark standalone is a bit picky with this hostname/IP stuff.
When you create your Spark master using the shell command "sbin/start-master.sh". go the the address http://localhost:8080 and check the "URL" row.
I notice no accepted answer, just for info I thought I'd mention a couple things.
First, in the spark-env.sh file in the conf directory, the SPARK_MASTER_IP and SPARK_LOCAL_IP settings can be hostnames. You don't want them to be, but they can be.
As noted in another answer, Spark can be a little picky about hostname vs. IP address, because of this resolved bug/feature: See bug here. The problem is, it's not clear if they "resolved" is simply by telling us to use IP instead of hostname?
Well I am having this same problem right now, and the first thing you do is check the basics.
Can you ping the box where the Spark master is running? Can you ping the worker from the master? More importantly, can you password-less ssh to the worker from the master box? Per 1.5.2 docs you need to be able to do that with a private key AND have the worker entered in the conf/slaves file. I copied the relevant paragraph at the end.
You can get a situation where the worker can contact the master but the master can't get back to the worker so it looks like no connection is being made. Check both directions.
Finally of all the combinations of settings, in a limited experiment just now I only found one that mattered: On the master, in spark-env.sh, set the SPARK_MASTER_IP to the IP address, not hostname. Then connect from the worker with spark://192.168.0.10:7077 and voila it connects! Seemingly none of the other config parameters are needed here.
Here's the paragraph from the docs about ssh and slaves file in conf:
To launch a Spark standalone cluster with the launch scripts, you
should create a file called conf/slaves in your Spark directory, which
must contain the hostnames of all the machines where you intend to
start Spark workers, one per line. If conf/slaves does not exist, the
launch scripts defaults to a single machine (localhost), which is
useful for testing. Note, the master machine accesses each of the
worker machines via ssh. By default, ssh is run in parallel and
requires password-less (using a private key) access to be setup. If
you do not have a password-less setup, you can set the environment
variable SPARK_SSH_FOREGROUND and serially provide a password for each
worker.
Once you have done that, using the IP address should work in your code. Let us know! This can be an annoying problem, and learning that most of the config params don't matter was nice.

Connecting to a Hadoop HDFS within a cloudera virtual machine via IP

I'm looking to connect to a hadoop instance on a cloudera virtual machine via the server address of the hdfs. Would anyone know how to find the IP address of this?
If not, how could I connect locally to the hdfs within virtual machine, as they are both running on the same computer.
Need to do this for a pentaho kettle connection.
If you're trying to configure Pentaho Data Integration (Kettle) to use HDFS as an input data source, then first you'll need to get the hostname/IP address and port number of the HDFS NameNode service, which you will then enter into (Pentaho) Spoon (the GUI to Kettle).
Getting HDFS NameNode IP/port number
The default port of the Hadoop HDFS NameNode service is 8020 in both CDH4 and 5 (source).
If for some reason you're not using the defaults, then the hostname/port of the HDFS NameNode service can be found in Cloudera Manager (which should be installed if you're using the Cloudera Quickstart VM, for example):
Click on the HDFS service on the main Cloudera Manager page
Click on Configuration - View and Edit
Click on NameNode - Ports and Addresses.
"NameNode Port" is the one you want, i.e. not "NameNode Web UI Port". As PDI needs the NameNode port.
Browse HDFS files in PDI to confirm
Test by opening Pentaho Data Integration (Spoon) and creating a "Hadoop Copy Files" transformation step, as an example, and then enter in your HDFS details in the "browse files" area and check if a directory list shows up.

Hadoop Dedoop Application unable to contact Hadoop Namenode : Getting "Unable to contact Namenode" error

I'm trying to use the Dedoop application that runs using Hadoop and HDFS on Amazon EC2. The Hadoop cluster is set up and the Namenode JobTracker and all other Daemons are running without error.
But the war Dedoop.war application is not able to connect to the Hadoop Namenode after deploying it on tomcat.
I have also checked to see if the ports are open in EC2.
Any help is appreciated.
If you're using Amazon AWS, I highly recommend using Amazon Elastic Map Reduce. Amazon takes care of setting up and provisioning the Hadoop cluster for you, including things like setting up IP addresses, NameNode, etc.
If you're setting up your own cluster on EC2, you have to be careful with public/private IP addresses. Most likely, you are pointing to the external IP addresses - can you replace them with the internal IP addresses and see if that works?
Can you post some lines of the Stacktrace from Tomcat's log files?
Dedoop must etablish an SOCKS proxy server (similar to ssh -D port username#host) to pass connections to Hadoop nodes on EC2. This is mainly because Hadoop resolves puplic IPS to EC2-internal IPs which breaks MR Jobs submission and HDFS management.
To this end Tomcat must be configured to to etablish ssh connections. The setup procedure is described here.

Configuring a slave's hostname using internal IP - Multiple NICs

In my Hadoop environment, I need to configure my slave nodes so that when they communicate in the middle of a map/reduce job they use the internal IP instead of the external IP that it picks up from the hostname.
Is there any way to set up my Hadoop config files to specify that the nodes should communicate using the internal IPs instead of the external IPs? I've already used the internal IPs in my core-site.xml, master, and slave files.
I've done some research and I've seen people mention the "slave.host.name" parameter, but which config file would I place this parameter in? Are there any other solutions to this problem?
Thanks!
The IP routing tables have to be changed so that the network between the Hadoop nodes uses a particular gateway. Don't think Hadoop has any setting to change which gateway to use.
You can configured slave.host.name in mapred-site.xml for each slave node.
Also remember to use that host name (instead of IP) consistently for all other configurations (core-site.xml, hdfs-site.xml, mapred-site.xml, masters, slaves) and also /etc/hosts file.

Resources