Moving files to Hadoop HDFS using SFTP - hadoop

I've a VPC subnet which has multiple machines inside it.
On of the machine, I've some files stored. On another machine, I've hadoop HDFS service installed and running.
I need to move those files from first machine to HDFS file system using SFTP.
Do Hadoop has some API's that can achieve this goal ?
PS : I've installed Hadoop using Cloudera CDH4 distribution.

This is a requirement which is much easier to implement on ftp/sftp server side than HDFS.
check out a ftp server works on top of HDFS hdfs-over-ftp

A workflow written in Apache Oozie would do it. It comes with the Cloudera distribution. Other tools for orchestration could be Talend or PDI Kettle.

Related

why I could use HBase without starting Hadoop/HDFS?

I am new to HBase, recently I installed HBase and tried to start it on my Mac. Everything is fine and I could play with HBase. In some articles, it said I should start Hadoop first when using HBase, I am wondering if this prerequisite changed?
Hadoop is not a hard requirement for HBase unless you are running fully distributed which you are not. Running on a single node like you are you can use the local filesystem. See HBase run modes: Standalone and Distributed for more information.
Your local filesystem (the file:// URI) is Hadoop-compatible. Hbase requires a Hadoop compatible storage layer, but that does not mean that it must literally be HDFS.
HDFS will simply provide scalability and reliability

How to download Hadoop files (on HDFS) via FTP?

I would like to implement an SSIS job that is able to download large CSV files that are located on a remote Hadoop cluster. Of course, having just a regular FTP server on Hadoop system does not expose HDFS files since it uses the local filesystem.
I would like to know whether there is an FTP server implementation that sits on top of HDFS. I would prefer this approach rather than having to copy files from HDFS to the local FS and then having the FTP server serving this because I will need to allocate more storage space.
I forked from an open-source project that works as expected: https://github.com/jamesattard/maroodi

Is it possible to write to a remote HDFS?

As title, is it possible to write to a remote HDFS?
E.g. I have installed a HDFS cluster on AWS EC2, and I want to write a file from my local computer to the HDFS cluster.
Two ways you could write to remote HDFS,
Use the WebHDFS api available.It supports the systems running outside
Hadoop clusters to access and manipulate the HDFS contents. It
doesn't require the client systems to have hadoop binaries installed.
Configure the client system as Hadoop edge node to interact with the
Hadoop cluster/HDFS.
Please refer,
https://hadoop.apache.org/docs/r1.2.1/webhdfs.html
http://www.dummies.com/how-to/content/edge-nodes-in-hadoop-clusters.html

Does Mahout need to be installed on the Hadoop's master node?

That's a dumb question, but somebody has to ask it.
I've tried running Mahout locally, which worked. Now, I wanna the work to be performed by a remote cluster, not my local machine.
So, should I deploy the Mahout code on Hadoop machines or I can still make Mahout on my local machine interface remotely with Hadoop?
No, you don't install Hadoop programs on the Hadoop workers yourself. That would be a nightmare to maintain. Hadoop does it for you when you provide it the JAR file with all code via hadoop jar.
What runs on your local machine, when you run Mahout or anything else Hadoop-based, is a client program that uses Hadoop code to send info to a cluster to start work. That cluster might be local, or remote -- makes no difference to how you run the client, just what the client talks to.

Access hdfs from outside hadoop

I want to run some executables outside of hadoop (but on the same cluster) using input files that are stored inside HDFS.
Do these files need to be copied locally to the node? or is there a way to access HDFS outside of hadoop?
Any other suggestions on how to do this are fine. Unfortunately my executables can not be run within hadoop though.
Thanks!
There are a couple typical ways:
You can access HDFS files through the HDFS Java API if you are writing your program in Java. You are probably looking for open. This will give you a stream that acts like a generic open file.
You can stream your data with hadoop cat if your program takes input through stdin: hadoop fs -cat /path/to/file/part-r-* | myprogram.pl. You could hypothetically create a bridge with this command line command with something like popen.
Also check WebHDFS which made into the 1.0.0 release and will be in the 23.1 release also. Since it's based on rest API, any language can access it and also Hadoop need not be installed on the node on which the HDFS files are required. Also. it's equally fast as the other options mentioned by orangeoctopus.
The best way is install "hadoop-0.20-native" package on the box where you are running your code.
hadoop-0.20-native package can access hdfs filesystem. It can act as a hdfs proxy.
I had similar issue and asked appropriate question. I needed to access HDFS / MapReduce services outside of cluster. After I found solution I posted answer here for HDFS. Most painfull issue there happened to be user authentication which in my case was solved in most simple case (complete code is in my question).
If you need to minimize dependencies and don't want to install hadoop on clients here is nice Cloudera article how to configure Maven to build JAR for this. 100% success for my case.
Main difference in Remote MapReduce job posting comparing to HDFS access is only one configuration setting (check for mapred.job.tracker variable).

Resources