SFTP file system in hadoop - hadoop

Does hadoop version 2.0.0 and CDH4 have a SFTP file system in place ? I know hadoop has a support for FTP Filesystem . Does it have something similar for sftp ? I have seen some patches submitted for the sme though couldn't make sense of them ..

Consider using hadoop distcp.
Check here. That would be something like:
hadoop distcp
-D fs.sftp.credfile=/user/john/credstore/private/mycreds.prop
sftp://myHost.ibm.com/home/biadmin/myFile/part1
hdfs:///user/john/myfiles

After some research , I have figured out that hadoop currently doesn't have a FileSystem written for SFTP . Hence if you wish to read data using SFTP channel you have to either write a SFTP FileSystem (which is quite a big deal , extending and overriding lots of classes and methods) , patches of which are already been developed , though not yet integrated into hadoop , else get a customized InputFormat that reads from streams , which again is not implemented in hadoop.

You need to ensure core-site.xml is having property fs.sftp.impl set with value org.apache.hadoop.fs.sftp.SFTPFileSystem
Post this hadoop commands will work. Couple of samples are given below
ls command
Command on hadoop
hadoop fs -ls /
equivalent for SFTP
hadoop fs -D fs.sftp.user.{hostname}={username} -D fs.sftp.password.{hostname}.{username}={password} -ls sftp://{hostname}:22/
Distcp command
Command on hadoop
hadoop distcp {sourceLocation} {destinationLocation}
equivalent for SFTP
hadoop distcp -D fs.sftp.user.{hostname}={username} -D fs.sftp.password.{hostname}.{username}={password} sftp://{hostname}:22/{sourceLocation} {destinationLocation}
Ensure you are replacing all the place holders while trying these commands. I tried them on AWS EMR 5.28.1 which has Hadoop 2.8.5 installed on it

So hopefully cleaning up these answers a bit into something more digestible. Basically Hadoop/HDFS is capable of support SFTP, it's just not enabled by default, nor is it really documented in the core-default.xml very well.
The key configuration you need to set to enable SFTP support is:
<property>
<name>fs.sftp.impl</name>
<value>org.apache.hadoop.fs.sftp.SFTPFileSystem</value>
</property>
Alternatively, you can set it right at the CLI depending on your command
hdfs dfs \
-Dfs.sftp.impl=org.apache.hadoop.fs.sftp.SFTPFileSystem \
-Dfs.sftp.keyfile=~/.ssh/java_sftp_testkey.ppk \
-ls sftp://$USER#localhost/tmp/
The biggest requirement is that your SSH Keyfile needs to be password-less to work. This can be done via
cp ~/.ssh/mykeyfile.ppk ~/.ssh/mykeyfile.ppk.orig
ssh-keygen -p -P MyPass -N "" -f ~/.ssh/mykeyfile.ppk
mv ~/.ssh/mykeyfile.ppk ~/.ssh/mykeyfile_nopass.ppk
mv ~/.ssh/mykeyfile.ppk.orig ~/.ssh/mykeyfile.ppk
And finally, the biggest (and maybe neatest) is using this via distcp, if you need to send/receive a large amount of data to/from an SFTP server. There's an oddity about the ssh keyfile being needed locally to generate the directory listing, as well as on the cluster for the actual workers.
Something like this should work well enough:
cd workdir
ln -s ~/.ssh/java_sftp_testkey.ppk
hadoop distcp \
--files ~/.ssh/java_sftp_testkey.ppk \
-Dfs.sftp.impl=org.apache.hadoop.fs.sftp.SFTPFileSystem \
-Dfs.sftp.keyfile=java_sftp_testkey.ppk \
hdfs:///path/to/source/ \
sftp://user#FQDN/path/to/dest

Related

Hadoop configure cluter queried based on a flag/env parameter

Apologies beforehand if this turns out to be a silly question, I am new to hadoop environment.
I have two hadoop clusters my-prod-cluster and my-bcp-cluster.
Both are accessible over the same network.
Is there any way to configure my clusters in such a way that when I am in BCP mode, all my queries to my-prod-cluster gets routed to my-bcp-cluster (on the basis of some config parameter or environment variable)
So when flag=prod
hadoop fs -ls /my-prod-cluster/mydir translates to hadoop fs -ls /my-prod-cluster/mydir
and fetches the data in /my-prod-cluster/mydir
when the flag=bcp
hadoop fs -ls /my-prod-cluster/mydir translates to hadoop fs -ls /my-bcp-cluster/mydir
and fetches data from /my-bcp-cluster/mydir
I am using [mapr][1] flavour of haddop(provided by HP), version 6.1, in case that matters
You could easily make a shell wrapper script that prepends the NameNode address to each query
For example, a fully-qualified command would look like this
hdfs dfs -ls hdfs://my-prod-cluster.domain.com/path/to/mydir
So, refactoring that, you could have a script like
#!/bin/sh
if [ $1 -eq "prod" ]; then
NAMENODE=hdfs://my-prod-cluster.domain.com
fi
# TODO: error handling and more clusters
PATH=$2
hdfs dfs -ls "${NAMENODE}${PATH}"
Then execute something like my-hdfs-ls prod /mydir
If you need something more complex than that like Kerberos tickets, and such, then creating a separate HADOOP_CONF_DIR variable with unique core-site and hdfs-site XMLs for each cluster would be recommended.

Move zip files from one server to hdfs?

What is the best approach to move files from one Linux box to HDFS should I use flume or ssh ?
SSH Command:
cat kali.txt | ssh user#hadoopdatanode.com "hdfs dfs -put - /data/kali.txt"
Only problem with SSH is I need to mention password every time need to check how to pass password without authentication.
Can flume move files straight to HDFS from one server?
Maybe you can make passwordless-ssh, then transfer files without entering password
Maybe you create a script in python for example which does the job for you
You could install hadoop client on a Linux box that has the files. Then you could "hdfs dfs -put" your data directly from that box to hadoop cluster.

Failed to copy file from FTP to HDFS

I have FTP server (F [ftp]), linux box(S [standalone]) and hadoop cluster (C [cluster]). The current files flow is F->S->C. I am trying to improve performance by skipping S.
The current flow is:
wget ftp://user:password#ftpserver/absolute_path_to_file
hadoop fs -copyFromLocal path_to_file path_in_hdfs
I tried:
hadoop fs -cp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
and:
hadoop distcp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
Both hangs. The distcp one being a job is killed by timeout. The logs (hadoop job -logs) only said it was killed by timeout. I tried to wget from the ftp from some node of the C and it worked. What could be the reason and any hint how to figure it out?
Pipe it through stdin:
wget ftp://user:password#ftpserver/absolute_path_to_file | hadoop fs -put - path_in_hdfs
The single - tells HDFS put to read from stdin.
hadoop fs -cp ftp://user:password#ftpserver.com/absolute_path_to_file path_in_hdfs
This cannot be used as the source file is a file in the local file system. It does not take into account the scheme you are trying to pass. Refer to the javadoc: FileSystem
DISTCP is only for large intra or inter cluster (to be read as Hadoop clusters i.e. HDFS). Again it cannot get data from FTP. 2 step process is still your best bet. Or write a program to read from FTP and write to HDFS.

How does one run "hadoop fs -text ." remotely using the Java API?

Basically, what I want is use the Hadoop Java API to call from local to a remote Hadoop cluster. I want the Hadoop cluster to execute the
It should be roughly equivalent to "ssh user#remote 'hadoop fs -text .'"
First of all, if all you want is exactly what hadoop fs -text gives you, then you can certainly just install the hadoop client on your local machine, and run it there, being sure to specify the full path:
hadoop fs -text hdfs://remote.namenode.host:9000/my/file
But if you do have a reason to do it from java, the basic answer is something like this:
Configuration conf = new Configuration();
Path p = new Path("hdfs://remote.namenode.host:9000/foo/bar");
FileSystem fs = p.getFileSystem(conf);
InputStream in = fs.open(p);
You can then read from that input stream however you like: copy it to stdout or whatever.
Note that the fs -text is a little bit more clever than just raw copying. It detects gzipped files and sequence files and "decodes" them into text. This is pretty tricky; you can check out the source code to see how its done internally.

Is there any way for a fully distributed Hadoop/MapReduce program to have its individual nodes be reading local input files?

I am trying to set up a fully-distributed Hadoop/MapReduce instance where each node will be running a series of C++ Hadoop Streaming task on some input. However I don't want to move all the input tasks onto HDFS - instead I want to see if there is a way to read input data from the local folders of each node.
Is there anyway to do this?
EDIT:
An example of a hadoop command I would like to run is something similar to:
hadoop jar $HADOOP_STREAM/hadoop-streaming-0.20.203.0.jar \
-mapper map_example \
-input file:///data/ \
-output /output/ \
-reducer reducer_example \
-file map_example \
-file reducer_example
In this case, the data stored in each of my nodes is in the /data/ directory, and I want the output to go to the /output/ directory of each individual node. The map_example and reducer_example files are locally available in all nodes.
How would I be able to implement a Hadoop command which if it is run on the master node, then all the slave nodes will essentially run the same task on an x number of nodes, resulting in a local output file in each node (based on the local input files)?
Thanks
As noted by this question, this appears possible. Though I have not tested this, it appears that you can set fs.default.name in conf/core-site.xml to refer to a file URL instead of an HDFS URL.
Some refs:
http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/27100
http://librarian.phys.washington.edu/athena/index.php/Running_Hadoop_on_Athena (this refers to an older version of Hadoop).
This is not exactly a hadoop solution but you could write a program(say Python) that fork multiple processes that will ssh into each of the slave machines and run the map reduce code.
hadoop dfsadmin -report
allows you to list the ips in the cluster.
You can make each process ssh into each of the ips and run the mapper and reducer.
Map reduce in *nix can be implemented using pipes.
cat <input> | c++ mapper | sort | c++ reducer > <output_location>

Resources