Downloading Hadoop Data from other PC - hadoop

I have Hadoop v2.6 installed in my one PC in Ubuntu OS 14.04. I have added lots of unstructured data using Hadoop -put command into HDFS.
Can someone tell me how to download this data from another PC which is not in Hadoop Cluster using the Web User Interface provided by Hadoop??
I can access the data from other PC by typing in the address bar of browser (the IP address of HDFS server):Port Number
Like this: 192.168.x.x:50070
The problem is, I am not able to download the data as it gives the error "Webpage Not Available". I also tried other browsers, but still no luck.

Port 50070 is the default name node port. You should try port 14000 which is the default HttpFS port. If it still doesn't work try using the example from the manual:
http://192.168.x.x:14000?user.name=babu&op=homedir

Related

Why can't standalone slaves connect to master on separate Mac OS boxes?

I have two Macs (both OS X EI Caption) at home, both are connected to same wifi. I want to install an spark cluster (with two workers) on this two computers.
Mac1 (192.168.1.2) is my master, with Spark 1.5.2, it is up and working well, and I can see the Spark UI at http://localhost:8080/ (also I see spark://Mac1:7077)
I also have run one slave on this machine (Mac1), and I see it under workers in the Spark UI.
Then, I have copied the Spark on the second machine (Mac2), and I am trying to run another Slave on Mac2 (192.168.2.9) by this command:
./sbin/start-slave.sh spark://Mac1:7077
But, it does not work: Looking at log it shows:
Failed to connect to master Mac1:7077
Actor not found for: ActorSelection[Anchor(akka.tcp://sparkMaster#Mac1:7077/),Path(/User/Master)]
Networking-wise, at Mac1, I can SSH to Mac2, and vice versa, but I cannot telnet to Mac1:7077.
I will appreciate it if you help me to solve this problem.
tl;dr Use -h option for ./sbin/start-master.sh, i.e. ./sbin/start-master.sh -h Mac1
Optionally, you could do ./sbin/start-slave.sh spark://192.168.1.2:7077 instead.
The reason is that binding to ports in Spark is very sensitive to what names and IPs are used. So, in your case, 192.168.1.2 != Mac1. They're different "names" in Spark, and that's why you can use ssh successfully as it uses name resolver on OS while it does not work at Spark level where the above condition holds, i.e. the "names" are not equal.
Likely a networking/firewall issue on the mac.
Also, your error message you copy/pasted reference port 7070. is this the issue?
using IP addresses in conf/slaves works somehow, but I have to use IP everywhere to address the cluster instead of host name.
SPARK + Standalone Cluster: Cannot start worker from another machine

Connecting to a Hadoop HDFS within a cloudera virtual machine via IP

I'm looking to connect to a hadoop instance on a cloudera virtual machine via the server address of the hdfs. Would anyone know how to find the IP address of this?
If not, how could I connect locally to the hdfs within virtual machine, as they are both running on the same computer.
Need to do this for a pentaho kettle connection.
If you're trying to configure Pentaho Data Integration (Kettle) to use HDFS as an input data source, then first you'll need to get the hostname/IP address and port number of the HDFS NameNode service, which you will then enter into (Pentaho) Spoon (the GUI to Kettle).
Getting HDFS NameNode IP/port number
The default port of the Hadoop HDFS NameNode service is 8020 in both CDH4 and 5 (source).
If for some reason you're not using the defaults, then the hostname/port of the HDFS NameNode service can be found in Cloudera Manager (which should be installed if you're using the Cloudera Quickstart VM, for example):
Click on the HDFS service on the main Cloudera Manager page
Click on Configuration - View and Edit
Click on NameNode - Ports and Addresses.
"NameNode Port" is the one you want, i.e. not "NameNode Web UI Port". As PDI needs the NameNode port.
Browse HDFS files in PDI to confirm
Test by opening Pentaho Data Integration (Spoon) and creating a "Hadoop Copy Files" transformation step, as an example, and then enter in your HDFS details in the "browse files" area and check if a directory list shows up.

How can I connect apache Nutch 2.x to a remote HBase cluster?

I have two machines. One machine runs HBase 0.92.2 in pseudo-distributed mode, while the other one is using Nutch 2.x crawler. How can I configure these two machines so that one machine with HBase-0.92.2 acts as back end storage and the other with Nutch-2.x acts as a crawler?
I finally did it.I was easy to do.
i am sharing my experience here. May be it can help someone.
1- change the configuration file of hbase-site.xml for pseudo distributed mode.
2- MOST IMPORTANT THING: on hbase machine, replace localhost ip in /etc/hosts with your real network ip like this
10.11.22.189 master localhost
hbase machine's ip = 10.11.22.189
(note: if you won't change your hbase machine's localhost ip, remote nutch crawler won't be able to connect to it)
4- copy/symlink hbase-site.xml into $NUTCH_HOME/conf
5- start your crawler and see it working

Upload file from my virtual machine to another virtual machine using hadoop hdfs

Can any one please tell me, how can I upload a txt file from my local machine to another virtual machine based on IP Address in hdfs.
Regards,
Baskar.V
You might find webHDFS REST API useful. I have tried it to write content from my local FS to the HDFS and it works fine. But being REST based it should work fine from local FS of a remote machine as well, provided both machines are connected.

Hadoop namenode web UI not opening in CDH4

I recently installed the CDH distribution of Cloudera to create a 2 node cluster. From the Cloudera Manager UI, all services are running fine.
All the command line tools (hive etc ) are also working fine and I am able to read and write data to hdfs.
However the namenode (and datanode) web UI alone is not opening. Checking on netstat -a | grep LISTEN, the processes are listening on the assigned ports and there are no firewall rules which are blocking the connections ( I already disabled iptables)
I initially though that it could be a DNS issue but even the IP address is not working. The Cloudera Manager installed on the same machine on another port is opening fine.
Any pointers on how to debug this problem?
I had faced the same issue.
First it was because NAMENODE in safemode
then after because of two IP address (I have two NIC configured on CDH Cluster one internal connectivity of the servers (10.0.0.1) and other is to connect servers form Internet (192.168.0.1))
When i try to open NAMENODE GUI form any of the server connected to cluster on network 10.0.0.1 then GUI is opening and works fine but from other any other machine connected to servers by 192.168.0.1 network it fails.

Resources