Hadoop distcp: what ports are used? - hadoop

If I want to use distCp on an on-prem hadoop cluster, so it can 'push' data to external cloud storage, what firewall considerations must be made in order to leverage this tool? What ports does the actual transfer of data take place on? Is it via SSH, and/or port 8020? I need to make sure network connectivity is provided for source to destination, but with the least amount of privileges ascribed to it. (i.e., only opening ports that are absolutely needed)

I do not believe SSH is used for the actual data transfer, other than you actually logging into the cluster and starting the command, for example.
At a minimum, it would be the RPC data-transfer ports for the NameNodes and Datanodes, so whatever you've configured for fs.defaultFS, dfs.namenode.rpc-address and dfs.datanode.address

Related

NIFI secure 3 node cluster

I am seeing some errors in my nifi cluster, I have a 3 node secured nifi cluster i am seeing the below errors. at the 2 nodes
ERROR [main] org.apache.nifi.web.server.JettyServer Unable to load flow due to:
java.io.IOException: org.apache.nifi.cluster.ConnectionException:
Failed to connect node to cluster due to: java.io.IOException:
Could not begin listening for incoming connections in order to load balance data across the cluster.
Please verify the values of the 'nifi.cluster.load.balance.port' and 'nifi.cluster.load.balance.host'
properties as well as the 'nifi.security.*' properties
See the clustering configuration guide for the list of clustering options you have to configure. For load balancing, you'll need to specify ports that are open in your firewall so that the nodes can communicate. You'll also need to make sure that each host has its node hostname property set, its host ports set and that there are no firewall restricts between the nodes and your Apache Zookeeper cluster.
If you want to simplify the setup to play around, you can use the information in the clustering configuration section of the admin guide to set up an embedded ZooKeeper node within each NiFi instance. However, I would recommend setting up an external ZooKeeper cluster. A little more work, but ultimately worth it.

What is the communication port between Namenode and Datanode in hadoop cluster

I want to know the communication protocol specifically port number used by Namenode and Datanode in hadoop.
Say, if I write the following command in Namenode,
hdfs dfsadmin -report
it will show the details of live nodes (namenode & datanode), how many datanodes are there etc. My question is how namenode and datanode communicates ? via which port? I am actually getting only 1 datanode with the above command whereas in my cluster, there are 8 datanodes. So, I am not sure whether any port blocking of networking is caused this!! My firewall is disabled in the namenode and all the datanodes. I have checked this via sudo ufw status command which returned inactive.
From hadoop official pages (link), I have found this:
The Communication Protocols
All HDFS communication protocols are layered on top of the TCP/IP
protocol. A client establishes a connection to a configurable TCP port
on the NameNode machine. It talks the ClientProtocol with the
NameNode. The DataNodes talk to the NameNode using the DataNode
Protocol. A Remote Procedure Call (RPC) abstraction wraps both the
Client Protocol and the DataNode Protocol. By design, the NameNode
never initiates any RPCs. Instead, it only responds to RPC requests
issued by DataNodes or clients.
I am using hadoop 3.1.1 in Ubuntu 16.04
Any help is highly appreciated. Thanks.
These are all configured in hdfs-site.xml.
For example, by default, dfs.datanode.address=0.0.0.0:9866
If you search for port or address, then you can generally find what you are looking for https://hadoop.apache.org/docs/r3.1.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
If that command or the NameNode UI don't show datanodes, then SSH to the individual nodes, check jps to see if process is running, and log files to find if the process is not running.

How to simulate external TCP traffic in Windows?

I don't have any code yet, so please feel free to move this to a sister site, if you think it belongs there :)
I have a Program A ( I don't have it's source code, so I can't modify it's behavior ) running on my machine which keeps listening to a particular port on the system for TCP data. It's a Peer to Peer application.
System 1 running A ====================== System 2 sunning A
The program A is supposed to run on systems where I may not be allowed to modify Firewall settings to allow incoming connections on the port the program listens to. I have an EC2 linux server running Ubuntu 16.
So I thought I can use an existing tool or create a program that would connect to the server on port X, and fetch the data from the server, and locally throw that data to the port A is listening to.
System 1 running A ========= SERVER =========== System 2 sunning A
What kind of configuration should I have on the server ? And is there any program I can use for this, or an idea of how to make one ?
I did something similar to bypass firewalls and hotspots.
Check this out https://github.com/yarrick/iodine, with a proper configuration your would be able to send\receive packets as DNS queries which is I know is always allowed, I used my server to get usual internet access with any hotspot I found.
You would lose some time, higher latency but you will have access.
Hope I helped.

Is there possible to config map-reduce use hostname rather than IP during data transferring?

Currently I am trying to migrate hdfs files between two different hadoop clusters by using distcp
The source cluster is isolated in a network, each machine has been associated with both external & internal IPs. Namenode talks to datanode through internal IP address
In the destination side, it's always failed to fetch data when using distcp, for it always try to connect to source datanode by using internal IPs of the source side which is invariably inaccessible.
org.apache.hadoop.hdfs.BlockReaderFactory: I/O error constructing remote block reader.
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.47.194.252:50010]
Is there possible to change IP to hostname in the case ? Then I could mapping the source hostname with external IP in the destination side.

Hadoop name node URL for WebHDFS

I have a clustered Named Node Setup. The Named nodes are configured to be Active and Passive.
When I make a WEBHDFS call, the URL to be provided is
http://:/webhdfs/v1/
Since I have 2 Named nodes available, I have 2 URL's available
http://:/webhdfs/v1/ - Its active now
http://:/webhdfs/v1/ - its passive now
My question is : The named nodes can failover any time. What value do I provide in HOST? Should I give the Service name? Is there a virtual IP that is normally configured in HDP platform which takes care of the redirection?
Or should I place a load balancer or gateway in front of the Named Nodes so that the failover is handled without any impact to the calling application.
It's a bug, it doesn't work in HA mode.
You have to explicitly put the active NN URL every time NN changes it's state.
https://hortonworks.jira.com/browse/BUG-30030
You will get an exception if you're talking to an inactive namenode.
See my answer here Any command to get active namenode for nameservice in hadoop?
You must determine the active Namenode first, then issue your WebHDFS API request to the active namenode. Issuing WebHDFS API requests to a standby namenode will result in an HTTP 403 error.
There is no automatic way to determine the active Namenode when using WebHDFS yet. You can use the hdfs command line client to query the configuration, or alternatively, loop through the Namenodes and issue JMX API requests to the `/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus" endpoint and parse the output.

Resources