How to implement Apache storm to monitor HDFS directory - hadoop

I have a HDFS directory where files will be copied continuously (streaming) from many sources.
How to build a topology for monitoring the HDFS directory, i.e that whenever a new file is created in that directory it should be processed.

You are looking to monitor HDFS file/directory changes.
Take a look this question, which points to existing support in Oozie and HBase:
How to know that a new data is been added to HDFS?
You can send items into your topology for processing when new files are detected by these tools.
Or you can write your own custom logic in storm, listing and checking if new files are added in HDFS periodically. Check out tick tuples support in Storm.

Related

how to load text files into hdfs through oozie workflow in a cluster

I am trying to load text/csv files in hive scripts with oozie and schedule it on daily basis. Text files are in local unix file system.
I need to put those text files into hdfs before executing the hive scripts in a oozie workflow.
In a real time cluster we don't know job will run on which node.it will run randomly in any one of the node in cluster.
can any one provide me the solution.
Thanks in advance.
Not sure I understand what you want to do.
The way I see it, it can't work:
Oozie server has access to HDFS files only (same as Hive)
your data is on a local filesystem somewhere
So why don't you load your files into HDFS beforehand? The transfer may be triggered either when the files are available (post-processing action in the upstream job) or at fixed time (using Linux CRON).
You don't even need the Hadoop libraries on the Linux box if the WebHDFS service is active on your NameNode - just use CURL and a HTTP upload.

Spark stream unable to read files created from flume in hdfs

I have created a real time application in which I am writing data streams to hdfs from weblogs using flume, and then processing that data using spark stream. But while flume is writing and creating new files in hdfs spark stream is unable to process those files. If I am putting the files to hdfs directory using put command spark stream is able to read and process the files. Any help regarding the same will be great.
You have detected the problem yourself: while the stream of data continues, the HDFS file is "locked" and can not be read by any other process. On the contrary, as you have experienced, if you put a batch of data (that's yur file, a batch, not a stream), once it is uploaded it is ready for being read.
Anyway, and not being an expert on Spark streaming, it seems from the Spark Streaming Programming Guide, Overview section, that you are not performing the right deployment. I mean, from the picture shown there, it seems the streaming (in this case generated by Flume) must be directly sent to Spark Streaming engine; then the results will be put in HDFS.
Nevertheless, if you want to maintain your deployment, i.e. Flume -> HDFS -> Spark, then my suggestion is to create mini-batches of data in temporal HDFS folders, and once the mini-batches are ready, store new data in a second minibatch, passing the first batch to Spark for analysis.
HTH
In addition to frb's answer: which is correct - SparkStreaming with Flume acts as an Avro RPC Server - you'll need to configure an AvroSink which points to your SparkStreaming instance.
with spark2, now you can connect directly your spark streaming to flume, see official docs, and then write once on HDFS at the end of the process.
import org.apache.spark.streaming.flume._
val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])

How can I use Oozie to copy remote files into HDFS?

I have to copy remote files into HDFS. I want to use Oozie because I need to run this job everyday at a specific time.
Oozie can help you create a workflow. Using oozie you can invoke an external action capable of copying files from your source to HDFS, but oozie will not do it automatically.
Here are a few suggestions:
Use a custom program to write files to hdfs, for example using a SequenceFile.Writer.
Flume might help.
Use an integration component like camel-hdfs to move files to hdfs.
ftp files to hdfs node and then copy from local disk to hdfs.
Investigate more options that might be a good fit for your case.

stream data from flume to collect data from different directories

the logs from different netwok devices are getting uploaded in different directory structure /appdat/logs/device//devicename.gzip. So all the devices will store their logs in respective ZIP code dir.Can any existing flume source be used to send the new uploaded file on any of the sub-directory to HDFS or do i need to write a new custom source.the cloudera version being used is cdh4
There is a change proposed by Phil Scala that will do recursive checking. To my knowledge it hasn't been accepted yet.
The current actively developed version is Apache Flume - not the Cloudera version.

getting data in and out of hadoop

I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-
Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp file after it reaches a certain size? and is it possible to append to a file in HDFS while also running queries/jobs on that same file at the same time?
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd.
Fluentd: Data Import from Java Applications
A hadoop job can run over multiple input files, so there's really no need to keep all your data as one file. You won't be able to process a file until its file handle is properly closed, however.
HDFS does not support appends (yet?)
What I do is run the map-reduce job periodically and output results to an 'processed_logs_#{timestamp}" folder.
Another job can later take these processed logs and push them to a database etc. so it can be queried on-line
I'd recommend using Flume to collect the log files from your servers into HDFS.

Resources