How to send files from Local Machine to HortonBox instance running on Virtual Box? - hadoop

I'm using Hortonbox 3.0.1 on a virtual box and ssh into it using putty. I have some files in my local machine (Windows 10), which I want to store in the hadoop file system.
SSH-ing into hortonbox instance, gives me a terminal of the instance, which means all files from the windows instance are not visible to the terminal.
Is there any way I can put files into the HDFS instance?
I am aware of WinSCP but that does not really serve my purpose. WinSCP would mean me sending the file onto the system, using my ssh to store the file on hadoop, and then deleting the file from the system after storing on data nodes. I might be wrong but this seems like additional and redundant work and I would always need a buffer for storage where hadoop is running, for extremely large files, this solution will almost certainly fail considering I would first need to store the entire file on the secondary disk, then send it to the data nodes through the name node. Is there any way to achieve this or the problem I'm facing is due to using a hortonbox instance? How does organizations handle sending data from several nodes to the namenode and then to datanodes?

First, you don't send data to the namenode for it to be placed on datanodes. When you issue hdfs put commands, the only information requested from the namenode is locations for the files to be placed.
That being said, if you want to skip SSH entirely, you need to forward the Namenode and datanode ports from the VM to your host, then install and configure the hadoop fs/hdfs commands on your windows host such that you can issue them directly from CMD.
The alternative is to use Fuse/SFTP/NFS/Samba mounts (referred to as a "shared folder" in the Virtualbox GUI) from Windows into the VM, where you could then run put without copying anything into the VM

Related

Can StreamSets be used to fetch data onto a local system?

Our team is exploring options for HDFS to local data fetch. We were suggested about StreamSets and no one in the team has an idea about it. Could anyone help me to understand if this will fit our requirement that is to fetch the data from HDFS onto our local system?
Just an additional question.
I have setup StreamSets locally. For example on local ip: xxx.xx.x.xx:18630 and it works fine on one machine. But when I try to access this URL from some other machine on the network, it doesn't work. While my other application like Shiny-server etc works fine with the same mechanism.
Yes - you can read data from HDFS to a local filesystem using StreamSets Data Collector's Hadoop FS Standalone origin. As cricket_007 mentions in his answer, though, you should carefully consider if this is what you really want to do, as a single Hadoop file can easily be larger than your local disk!
Answering your second question, Data Collector listens on all addresses by default. There is a http.bindHost setting in the sdc.properties config file that you can use to restrict the addresses that Data Collector listens on, but it is commented out by default.
You can use netstat to check - this is what I see on my Mac, with Data Collector listening on all addresses:
$ netstat -ant | grep 18630
tcp46 0 0 *.18630 *.* LISTEN
That wildcard, * in front of the 18630 in the output means that Data Collector will accept connections on any address.
If you are running Data Collector directly on your machine, then the most likely problem is a firewall setting. If you are running Data Collector in a VM or on Docker, you will need to look at your VM/Docker network config.
I believe by default Streamsets only exposes its services on localhost. You'll need to go through the config files to find where you can set it to listen on external addresses
If you are using the CDH Quickstart VM, you'll need to externally forward that port.
Anyway, StreamSets is really designed to run as a cluster, on dedicated servers, for optimal performance. It's production deployments are comparable to Apache Nifi offered in Hortonworks HDF.
So no, it wouldn't make sense to use the local FS destinations for anything other than testing/evaluation purposes.
If you want HDFS exposed as a local device, look into installing an NFS Gateway. Or you can use Streamsets to write to FTP / NFS, probably.
It's not clear what data you're trying to get, but many BI tools can perform CSV exports or Hue can be used to download files from HDFS. At the very least, hdfs dfs -getmerge is the one minimalist way to get data from HDFS to local, however, Hadoop typically stores many TB worth of data in the ideal cases, and if you're using anything smaller, then dumping those results into a database is typically the better option than moving around flatfiles

Integrate local HDFS filesystem browser with IntelliJ IDEA

I studied MapReduce paradigm using the HDFS cluster of my university, accessing to it by HUE. From HUE I am able to browse files, read/edit them and so on.
So in that cluster I need:
a normal folder where I put the MapReduce.jar
the access to the results in the HDFS
I like very much write MapReduce applications so I have configured correctly a local HDFS as personal playground but for now I can access to it only thorough really time-wasting command line (such as those).
I can access "directly" to the HDFS of my thorough IntelliJ IDEA by the mean of SFTP remote host connection, following is the "user normal folder":
And here is the HDFS from HUE from which I get the results:
Obviously in my local machine the "normal user folder" is where I am with the shell, but I can browse HDFS to get results only by command line.
I wish I could do such a thing even for local HDFS. Following is the best I could do:
I know that it is possible to access HDFS by http://localhost:50070/explorer.html#/ but it is very terrible.
I looked for some plugins, but I did not find anything useful. Using the command line in the long run becomes tiring.
I can access "directly" to the HDFS of my thorough IntelliJ IDEA by the mean of SFTP remote host ...
Following is the best I could do...
Neither of those are HDFS.
Is the user folder of the machine you SSH'd to
Is only the NameNode data directory on your local machine
Hue uses WebHDFS, and connects through http://namenode:50070
What you would need is a plugin that can connect to the same API, which is not over SSH, or a simple file mount.
If you wanted a file mount, you need to setup an NFS Gateway, and you mount the NFS drive like any other network attached storage.
In Production environments, you would write your code, push it to Github, then Jenkins (for example) would build the code and push it to HDFS for you.

how users should work with ambari cluster

My question is pretty trivial but didnt find anyone actually asking it.
We have a ambari cluster with spark storm hbase and hdfs(among other things).
I dont understand how a user that want to use that cluster use it.
for example, a user want to copy a file to hdfs, run a spark-shell or create new table in hbase shell.
should he get a local account on the server that run the cooresponded service? shouldn't he use a 3rd party machine(his own laptop for example)?
If so ,how one should use hadoop fs, there is no way to specify the server ip like spark-shell has.
what is the normal/right/expected way to run all these tasks from a user prespective.
Thanks.
The expected way to run the described tasks from the command line is as follows.
First, gain access to the command line of a server that has the required clients installed for the services you want to use, e.g. HDFS, Spark, HBase et cetera.
During the process of provisioning a cluster via Ambari, it is possible to define one or more servers where the clients will be installed.
Here you can see an example of an Ambari provisioning process step. I decided to install the clients on all servers.
Afterwards, one way to figure out which servers have the required clients installed is to check your hosts views in Ambari. Here you can find an example of an Ambari hosts view: check the green rectangle to see the installed clients.
Once you have installed the clients on one or more servers, these servers will be able to utilize the services of your cluster via the command line.
Just to be clear, the utilization of a service by a client is location-independent from the server where the service is actually running.
Second, make sure that you are compliant with the security mechanisms of your cluster. In relation to HDFS, this could influence which users you are allowed to use and which directories you can access by using them. If you do not use security mechanisms like e.g. Kerberos, Ranger and so on, you should be able to directly run your stated tasks from the command line.
Third, execute your tasks via command line.
Here is a short example of how to access HDFS without considering security mechanisms:
ssh user#hostxyz # Connect to the server that has the required HDFS client installed
hdfs dfs -ls /tmp # Command to list the contents of the HDFS tmp directory
Take a look on Ambari views, especially on Files view that allows browsing HDFS

Pull a file from remote location (local file system in some remote machine) into Hadoop HDFS

I have files in a machine (say A) which is not part of the Hadoop (OR HDFS) datacenter. So machine A is at remote location from HDFS datacenter.
Is there a script OR command OR program OR tool that can run in machines which are connected to Hadoop (part of the datacenter) and pull-in the file from machine A to HDFS directly ? If yes, what is the best and fastest way to do this ?
I know there are many ways like WebHDFS, Talend but they need to run from Machine A and requirement is to avoid that and run it in machines in datacenter.
There are two ways to achieve this:
You can pull the data using scp and store it in a temporary location, then copy it to hdfs, and delete the temporarily stored data.
If you do not want to keep it as a 2-step process, you can write a program which will read the files from the remote machine, and write it to HDFS directly.
This question along with comments and answers would come in handy for reading the file while, you can use the below snippet to write to HDFS.
outFile = <Path to the the file including name of the new file> //e.g. hdfs://localhost:<port>/foo/bar/baz.txt
FileSystem hdfs =FileSystem.get(new URI("hdfs://<NameNode Host>:<port>"), new Configuration());
Path newFilePath=new Path(outFile);
FSDataOutputStream out = hdfs.create(outFile);
// put in a while loop here which would read until EOF and write to the file using below statement
out.write(buffer);
Let buffer = 50 * 1024, if you have enough IO capicity depending on processor or you could use a much lower value like 10 * 1024 or something
Please tell me if I am getting your Question right way.
1-you want to copy the file in a remote location.
2- client machine is not a part of Hadoop cluster.
3- It is may not contains the required libraries for Hadoop.
Best way is webHDFS i.e. Rest API

Hadoop cluster on NFS

I'm trying to setup a hadoop cluster on 5 machines on same lan with NFS. The problem im facing is that the copy of hadoop on one machine is replicated on all the machines, so i cant provide exclusive properties for each slaves. Due to this, i get "Cannot create lock" kind of errors. The FAQ suggests that NFS should not be used, but i have no other option.
Is there a way where i can specify properties like, Master should pick its conf files from location1, slave1 should pick its conf files from location2 .....
Just to be clear, there's a difference between configurations for compute nodes and HDFS storage. Your issue appears to be solely the storage for configurations. This can and should be done locally, or at least let each machine map to a symlink based on some locally identified configuration (e.g. Mach01 -> /etc/config/mach01, ...).
(Revision 1) Regarding the comment/question below about symlinks: First, I'm going to admit that this is not something I can immediately solve. There are 2 approaches I see:
Have a script (e.g. at startup or as a wrapper for starting Hadoop) on the machine determine the hostname (e.g. hostname -a') which then identifies a local symlink (e.g./usr/local/hadoopConfig') to the correct directory on the NFS directory structure.
Set an environment variable, a la HADOOP_HOME, based on the local machine's hostname, and let various scripts work with this.
Although #1 should work, it is a method relayed to me, not one that I set up, and I'd be a little concerned about symlinks in the event that the hostname is misconfigured (this can happen). Method #2 is one that seems more robust.

Resources