Loading Batch Offline Data to DWH environment with Kafka as the "Entering door" - hadoop

Some context to my question.
As you can see here:
https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c
There are 2 "doors" to load data into HDFS
Sqoop
Kafka
Using this topology as an example, what will be the best practice to load batch offline data which is hosted on an FTP server info HDFS?
Let's also assume that no changes are needed to perform on the file, we need to store it in HDFS in the same structure it is stored in the FTP server.
Thoughts?

Kafka isn't exactly configured to transfer "file sized" data by default. At least, not entire files in one message. Maybe break the lines apart, but then you need to reorder them and put them back together in HDFS.
In my experience, I've seen a few options from an FTP server.
Vanilla Hadoop, no extra software required
Use an NFS Gateway, WebHDFS or HttpFS to copy files directly to HDFS as if it were another filesystem
Additional Software required
Your own code with an FTP and HDFS client connection
Spark Streaming w/ an FTP Connector and HDFS write output
Kafka & Kafka Connect with an FTP Connector source and HDFS Sink
A Flume agent running on the FTP Server with an HDFS sink
Apache NiFi with a GetFTP and PutHDFS processor
Streamsets Data Collector doing something similar to NiFi (don't know the terms for this one)
we need to store it in HDFS in the same structure it is stored in the FTP server.
If these are small files, you're better off at least compressing the files into a Hadoop supported archive format before uploading to HDFS

Related

nifi putHDFS writes to local filesystem

Challenge
I currently have two hortonworks clusters, a NIFI cluster and a HDFS cluster, and want to write to HDFS using NIFI.
On the NIFI cluster I use a simple GetFile connected to a PutHDFS.
When pushing a file through this, the PutHDFS terminates in success. However, rather than seeing a file dropped on my HFDS (on the HDFS cluster) I just see a file being dropped onto the local filesystem where I run NIFI.
This confuses me, hence my question:
How to make sure PutHDFS writes to HDFS, rather than to the local filesystem?
Possibly relevant context:
In the PutHDFS I have linked to the hive-site and core-site of the HDFS cluster (I tried updating all server references to the HDFS namenode, but with no effect)
I don't use Kerberos on the HDFS cluster (I do use it on the NIFI cluster)
I did not see anything looking like an error in the NIFI app log (which makes sense as it succesfully writes, just in the wrong place)
Both clusters are newly generated on Amazon AWS with CloudBreak, and opening all nodes to all traffic did not help
Can you make sure that you are able move file from NiFi node to Hadoop using below command:-
hadoop fs -put
If you are able move your file using above command then you must check your Hadoop config file which you are passing in your PutHDFS processor.
Also, check that you don't have anyother flow running to make sure that no other flow is processing that file.

Import live data from server to hdfs instantly?

Is this possible to load live feed from server to HDFS because I have to load the live feed data on HDFS which is coming via server instantly without any loss of time?
There are lots of technologies that consume data in real time (or near real-time) and have write connectors to hdfs.
Flume
Nifi
Streamsets
are ones I have used.

How to download Hadoop files (on HDFS) via FTP?

I would like to implement an SSIS job that is able to download large CSV files that are located on a remote Hadoop cluster. Of course, having just a regular FTP server on Hadoop system does not expose HDFS files since it uses the local filesystem.
I would like to know whether there is an FTP server implementation that sits on top of HDFS. I would prefer this approach rather than having to copy files from HDFS to the local FS and then having the FTP server serving this because I will need to allocate more storage space.
I forked from an open-source project that works as expected: https://github.com/jamesattard/maroodi

Hdfs put VS webhdfs

I'm loading 28 GB file in hadoop hdfs using webhdfs and it takes ~25 mins to load.
I tried loading same file using hdfs put and It took ~6 mins. Why there is so much difference in performance?
What is recommended to use? Can somebody explain or direct me to some good link it will be really helpful.
Below us the command I'm using
curl -i --negotiate -u: -X PUT "http://$hostname:$port/webhdfs/v1/$destination_file_location/$source_filename.temp?op=CREATE&overwrite=true"
this will redirect to a datanode address which I use in next step to write the data.
Hadoop provides several ways of accessing HDFS
All of the following support almost all features of the filesystem -
1. FileSystem (FS) shell commands: Provides easy access of Hadoop file system operations as well as other file systems that Hadoop
supports, such as Local FS, HFTP FS, S3 FS.
This needs hadoop client to be installed and involves the client to write blocks
directly to one Data Node. All versions of Hadoop do not support all options for copying between filesystems.
2. WebHDFS: It defines a public HTTP REST API, which permits clients to access Hadoop from multiple languages without installing
Hadoop, Advantage being language agnostic way(curl, php etc....).
WebHDFS needs access to all nodes of the cluster and when some data is
read, it is transmitted from the source node directly but **there is a overhead
of http ** (1)FS Shell but works agnostically and no problems with different hadoop cluster and versions.
3. HttpFS. Read and write data to HDFS in a cluster behind a firewall. Single node will act as GateWay node through which all the
data will be transfered and performance wise I believe this can be
even slower but preferred when needs to pull the data from public source into a secured cluster.
So choose rightly!.. Going down the list will always be an alternative when the choice above it is not available to you.
Hadoop provides a FileSystem Shell API to support file system operations such as create, rename or delete files and directories, open, read or write file.
The FileSystem shell is a java application that uses java FileSystem class to
provide FileSystem operations. FileSystem Shell API creates RPC connection for the operations.
If the client is within the Hadoop cluster, then this is useful because it use hdfs URI scheme to connect with the hadoop distributed FileSystem and hence client makes a direct RPC connection to write data into HDFS.
This is good for applications running within the Hadoop cluster but there may be use cases where an external application needs to manipulate HDFS like it needs to create directories and write files to that directory or read the content of a file stored on HDFS. Hortonworks developed an API to support these requirements based on standard REST functionality called WebHDFS.
WebHDFS provides the REST API functionality where any external application can connect the DistributedFileSystem over HTTP connection. No matter that the external application is Java or PHP.
WebHDFS concept is based on HTTP operations like GET, PUT, POST and DELETE.
Operations like OPEN, GETFILESTATUS, LISTSTATUS are using HTTP GET, others like CREATE, MKDIRS, RENAME, SETPERMISSIONS are relying on HTTP PUT
It provides secure read-write access to HDFS over HTTP. It is basically intended
as a replacement of HFTP(read only access over HTTP) and HSFTP(read only access over HTTPS).It used webhdfs URI scheme to connect with Distributed file system.
If the client is outside the Hadoop Cluster and trying to access HDFS. WebHDFS is usefull for it.Also If you are trying to connect the two difference version of Hadoop cluster then WebHDFS is usefull as it used REST API so it is independent of MapReduce or HDFS version.
The difference between HDFS access and WebHDFS is scalability due to the design of HDFS and the fact that a HDFS client decomposes a file into splits living in different nodes. When an HDFS client access file content, under the covers it goes to the NameNode and gets a list of file splits and their physical location on a Hadoop cluster.
It then can go do DataNodes living on all those locations to fetch blocks in the splits in parallel, piping the content directly to the client.
WebHDFS is a proxy living in the HDFS cluster and it layers on HDFS, so all data needs to be streamed to the proxy before it gets relayed on to the WebHDFS client. In essence it becomes a single point of access and an IO bottleneck.
You can you traditional java client api (which is being internally used by linux commands of hdfs).
From what I have read from here.
The performance of java client and Rest based approach have similar performance.

Spark stream unable to read files created from flume in hdfs

I have created a real time application in which I am writing data streams to hdfs from weblogs using flume, and then processing that data using spark stream. But while flume is writing and creating new files in hdfs spark stream is unable to process those files. If I am putting the files to hdfs directory using put command spark stream is able to read and process the files. Any help regarding the same will be great.
You have detected the problem yourself: while the stream of data continues, the HDFS file is "locked" and can not be read by any other process. On the contrary, as you have experienced, if you put a batch of data (that's yur file, a batch, not a stream), once it is uploaded it is ready for being read.
Anyway, and not being an expert on Spark streaming, it seems from the Spark Streaming Programming Guide, Overview section, that you are not performing the right deployment. I mean, from the picture shown there, it seems the streaming (in this case generated by Flume) must be directly sent to Spark Streaming engine; then the results will be put in HDFS.
Nevertheless, if you want to maintain your deployment, i.e. Flume -> HDFS -> Spark, then my suggestion is to create mini-batches of data in temporal HDFS folders, and once the mini-batches are ready, store new data in a second minibatch, passing the first batch to Spark for analysis.
HTH
In addition to frb's answer: which is correct - SparkStreaming with Flume acts as an Avro RPC Server - you'll need to configure an AvroSink which points to your SparkStreaming instance.
with spark2, now you can connect directly your spark streaming to flume, see official docs, and then write once on HDFS at the end of the process.
import org.apache.spark.streaming.flume._
val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])

Resources