How to download Hadoop files (on HDFS) via FTP? - hadoop

I would like to implement an SSIS job that is able to download large CSV files that are located on a remote Hadoop cluster. Of course, having just a regular FTP server on Hadoop system does not expose HDFS files since it uses the local filesystem.
I would like to know whether there is an FTP server implementation that sits on top of HDFS. I would prefer this approach rather than having to copy files from HDFS to the local FS and then having the FTP server serving this because I will need to allocate more storage space.

I forked from an open-source project that works as expected: https://github.com/jamesattard/maroodi

Related

Integrate local HDFS filesystem browser with IntelliJ IDEA

I studied MapReduce paradigm using the HDFS cluster of my university, accessing to it by HUE. From HUE I am able to browse files, read/edit them and so on.
So in that cluster I need:
a normal folder where I put the MapReduce.jar
the access to the results in the HDFS
I like very much write MapReduce applications so I have configured correctly a local HDFS as personal playground but for now I can access to it only thorough really time-wasting command line (such as those).
I can access "directly" to the HDFS of my thorough IntelliJ IDEA by the mean of SFTP remote host connection, following is the "user normal folder":
And here is the HDFS from HUE from which I get the results:
Obviously in my local machine the "normal user folder" is where I am with the shell, but I can browse HDFS to get results only by command line.
I wish I could do such a thing even for local HDFS. Following is the best I could do:
I know that it is possible to access HDFS by http://localhost:50070/explorer.html#/ but it is very terrible.
I looked for some plugins, but I did not find anything useful. Using the command line in the long run becomes tiring.
I can access "directly" to the HDFS of my thorough IntelliJ IDEA by the mean of SFTP remote host ...
Following is the best I could do...
Neither of those are HDFS.
Is the user folder of the machine you SSH'd to
Is only the NameNode data directory on your local machine
Hue uses WebHDFS, and connects through http://namenode:50070
What you would need is a plugin that can connect to the same API, which is not over SSH, or a simple file mount.
If you wanted a file mount, you need to setup an NFS Gateway, and you mount the NFS drive like any other network attached storage.
In Production environments, you would write your code, push it to Github, then Jenkins (for example) would build the code and push it to HDFS for you.

Loading Batch Offline Data to DWH environment with Kafka as the "Entering door"

Some context to my question.
As you can see here:
https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c
There are 2 "doors" to load data into HDFS
Sqoop
Kafka
Using this topology as an example, what will be the best practice to load batch offline data which is hosted on an FTP server info HDFS?
Let's also assume that no changes are needed to perform on the file, we need to store it in HDFS in the same structure it is stored in the FTP server.
Thoughts?
Kafka isn't exactly configured to transfer "file sized" data by default. At least, not entire files in one message. Maybe break the lines apart, but then you need to reorder them and put them back together in HDFS.
In my experience, I've seen a few options from an FTP server.
Vanilla Hadoop, no extra software required
Use an NFS Gateway, WebHDFS or HttpFS to copy files directly to HDFS as if it were another filesystem
Additional Software required
Your own code with an FTP and HDFS client connection
Spark Streaming w/ an FTP Connector and HDFS write output
Kafka & Kafka Connect with an FTP Connector source and HDFS Sink
A Flume agent running on the FTP Server with an HDFS sink
Apache NiFi with a GetFTP and PutHDFS processor
Streamsets Data Collector doing something similar to NiFi (don't know the terms for this one)
we need to store it in HDFS in the same structure it is stored in the FTP server.
If these are small files, you're better off at least compressing the files into a Hadoop supported archive format before uploading to HDFS

Apache Spark Streaming from folder (not HDFS)

I was wondering if there is any reliable way for creating spark streams from a physical location? I was using 'textFileStream' but seems it is mainly used if the files are in HDFS. If you see the definition of the function it says "Create an input stream that monitors a Hadoop-compatible filesystem"
Are you implying that HDFS is not a physical location? There are datanode directories that physically exist...
You should be able to use textFile with the file:// URI, but you need to ensure all nodes in the cluster can read from that location.
From the definition of Hadoop compatible filesystem.
The selection of which filesystem to use comes from the URI scheme used to refer to it -the prefix hdfs: on any file path means that it refers to an HDFS filesystem; file: to the local filesystem, s3: to Amazon S3, ftp: FTP, swift: OpenStackSwift, ...etc.
There are other filesystems that provide explicit integration with Hadoop through the relevant Java JAR files, native binaries and configuration parameters needed to add a new schema to Hadoop

Pull a file from remote location (local file system in some remote machine) into Hadoop HDFS

I have files in a machine (say A) which is not part of the Hadoop (OR HDFS) datacenter. So machine A is at remote location from HDFS datacenter.
Is there a script OR command OR program OR tool that can run in machines which are connected to Hadoop (part of the datacenter) and pull-in the file from machine A to HDFS directly ? If yes, what is the best and fastest way to do this ?
I know there are many ways like WebHDFS, Talend but they need to run from Machine A and requirement is to avoid that and run it in machines in datacenter.
There are two ways to achieve this:
You can pull the data using scp and store it in a temporary location, then copy it to hdfs, and delete the temporarily stored data.
If you do not want to keep it as a 2-step process, you can write a program which will read the files from the remote machine, and write it to HDFS directly.
This question along with comments and answers would come in handy for reading the file while, you can use the below snippet to write to HDFS.
outFile = <Path to the the file including name of the new file> //e.g. hdfs://localhost:<port>/foo/bar/baz.txt
FileSystem hdfs =FileSystem.get(new URI("hdfs://<NameNode Host>:<port>"), new Configuration());
Path newFilePath=new Path(outFile);
FSDataOutputStream out = hdfs.create(outFile);
// put in a while loop here which would read until EOF and write to the file using below statement
out.write(buffer);
Let buffer = 50 * 1024, if you have enough IO capicity depending on processor or you could use a much lower value like 10 * 1024 or something
Please tell me if I am getting your Question right way.
1-you want to copy the file in a remote location.
2- client machine is not a part of Hadoop cluster.
3- It is may not contains the required libraries for Hadoop.
Best way is webHDFS i.e. Rest API

Moving files to Hadoop HDFS using SFTP

I've a VPC subnet which has multiple machines inside it.
On of the machine, I've some files stored. On another machine, I've hadoop HDFS service installed and running.
I need to move those files from first machine to HDFS file system using SFTP.
Do Hadoop has some API's that can achieve this goal ?
PS : I've installed Hadoop using Cloudera CDH4 distribution.
This is a requirement which is much easier to implement on ftp/sftp server side than HDFS.
check out a ftp server works on top of HDFS hdfs-over-ftp
A workflow written in Apache Oozie would do it. It comes with the Cloudera distribution. Other tools for orchestration could be Talend or PDI Kettle.

Resources