Import live data from server to hdfs instantly? - hadoop

Is this possible to load live feed from server to HDFS because I have to load the live feed data on HDFS which is coming via server instantly without any loss of time?

There are lots of technologies that consume data in real time (or near real-time) and have write connectors to hdfs.
Flume
Nifi
Streamsets
are ones I have used.

Related

Loading Batch Offline Data to DWH environment with Kafka as the "Entering door"

Some context to my question.
As you can see here:
https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c
There are 2 "doors" to load data into HDFS
Sqoop
Kafka
Using this topology as an example, what will be the best practice to load batch offline data which is hosted on an FTP server info HDFS?
Let's also assume that no changes are needed to perform on the file, we need to store it in HDFS in the same structure it is stored in the FTP server.
Thoughts?
Kafka isn't exactly configured to transfer "file sized" data by default. At least, not entire files in one message. Maybe break the lines apart, but then you need to reorder them and put them back together in HDFS.
In my experience, I've seen a few options from an FTP server.
Vanilla Hadoop, no extra software required
Use an NFS Gateway, WebHDFS or HttpFS to copy files directly to HDFS as if it were another filesystem
Additional Software required
Your own code with an FTP and HDFS client connection
Spark Streaming w/ an FTP Connector and HDFS write output
Kafka & Kafka Connect with an FTP Connector source and HDFS Sink
A Flume agent running on the FTP Server with an HDFS sink
Apache NiFi with a GetFTP and PutHDFS processor
Streamsets Data Collector doing something similar to NiFi (don't know the terms for this one)
we need to store it in HDFS in the same structure it is stored in the FTP server.
If these are small files, you're better off at least compressing the files into a Hadoop supported archive format before uploading to HDFS

Different ways to import files into HDFS

I want to know what are the different ways through which I can bring data into HDFS.
I am a newbie to Hadoop and was a java web developer till this time. I want to know if I have a web application that is creating log files, how can i import the log files into HDFS.
There are lot's of ways on how you can ingest data into HDFS, let me try to illustrate them here:
hdfs dfs -put - simple way to insert files from local file system to HDFS
HDFS Java API
Sqoop - for bringing data to/from databases
Flume - streaming files, logs
Kafka - distributed queue, mostly for near-real time stream processing
Nifi - incubating project at Apache for moving data into HDFS without making lots of changes
Best solution for bringing web application logs to HDFS is through Flume.
We have three different kinds of data - Structured (schema based systems like Oracle/MySQL etc.), Unstructured (images, weblogs etc.) and Semi-structured data(XML etc.)
Structured data can be stored in database SQL in table with rows and columns
Semi-structured data is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. With some process you can store them in relation database (e.g. XML)
Unstructured data often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents.
Depending on type of your data, you will choose the tools to import data into HDFS.
Your company may use CRM,ERP tools. But we don't exactly know how the data is organized & structured.
If we leave simple HDFS commands like put, copyFromLocal etc to load data into HDFS compatible format, below are the main tools to load data into HDFS
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Data from MySQL, SQL Server & Oracle tables can be loaded into HDFS with this tool.
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.
Other tools include Chukwa,Storm and Kafka
But other important technology, which is becoming very popular is Spark. It is a Friend & Foe for Hadoop.
Spark is emerging an good alternative to Hadoop for real time data processing, which may or may not use HDFS as data source.

Data Ingestion Into HDFS by unique technique

I want to transfer Un-semi structured data(MS word/PDF/JSON) from a remote computer into hadoop(could be in batch and could be near realtime but not stream).
I have to Make sure that data is moved quickly from Remote location to my local machine(working on low bandwidth)into HDFS or local machine.
Fro example Internet Download Manager has this amazing technique of making several connections with the FTP and utilizing low bandwidth with more connections.
Is there any possibility that Hadoop ecosystem provides such a tool to ingest data into hadoop. Or any self made technique?
Which Tool/Technique could be better.
You could use the Web HDFS API http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Document_Conventions

Spark stream unable to read files created from flume in hdfs

I have created a real time application in which I am writing data streams to hdfs from weblogs using flume, and then processing that data using spark stream. But while flume is writing and creating new files in hdfs spark stream is unable to process those files. If I am putting the files to hdfs directory using put command spark stream is able to read and process the files. Any help regarding the same will be great.
You have detected the problem yourself: while the stream of data continues, the HDFS file is "locked" and can not be read by any other process. On the contrary, as you have experienced, if you put a batch of data (that's yur file, a batch, not a stream), once it is uploaded it is ready for being read.
Anyway, and not being an expert on Spark streaming, it seems from the Spark Streaming Programming Guide, Overview section, that you are not performing the right deployment. I mean, from the picture shown there, it seems the streaming (in this case generated by Flume) must be directly sent to Spark Streaming engine; then the results will be put in HDFS.
Nevertheless, if you want to maintain your deployment, i.e. Flume -> HDFS -> Spark, then my suggestion is to create mini-batches of data in temporal HDFS folders, and once the mini-batches are ready, store new data in a second minibatch, passing the first batch to Spark for analysis.
HTH
In addition to frb's answer: which is correct - SparkStreaming with Flume acts as an Avro RPC Server - you'll need to configure an AvroSink which points to your SparkStreaming instance.
with spark2, now you can connect directly your spark streaming to flume, see official docs, and then write once on HDFS at the end of the process.
import org.apache.spark.streaming.flume._
val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])

Hadoop - streaming data from HTTP upload (PUT) into HDFS directly

I have the following application deployment
web front-end taking data from client through HTTP/FTP
hadoop cluster
I need to store client's data on HDFS. What is the best way of doing that? Is it possible to stream data to HDFS directly, without consuming all data from the client on local drive, and then put it into the HDFS?
The feasible options which I can think of right now are :
HttpFS
WebHDFS
FTP client over HDFS
HDFS over WebDAV
Choosing the "best" one is totally upto you, based on your convenience and ease.
Personally, if you want low latency access to HDFS, your best bet is HBase. You can put and get values very easily since it is just a key value store. We are using the same thing in our application(s) and it works fabulously.

Resources