How can I use Oozie to copy remote files into HDFS? - hadoop

I have to copy remote files into HDFS. I want to use Oozie because I need to run this job everyday at a specific time.

Oozie can help you create a workflow. Using oozie you can invoke an external action capable of copying files from your source to HDFS, but oozie will not do it automatically.
Here are a few suggestions:
Use a custom program to write files to hdfs, for example using a SequenceFile.Writer.
Flume might help.
Use an integration component like camel-hdfs to move files to hdfs.
ftp files to hdfs node and then copy from local disk to hdfs.
Investigate more options that might be a good fit for your case.

Related

how to load text files into hdfs through oozie workflow in a cluster

I am trying to load text/csv files in hive scripts with oozie and schedule it on daily basis. Text files are in local unix file system.
I need to put those text files into hdfs before executing the hive scripts in a oozie workflow.
In a real time cluster we don't know job will run on which node.it will run randomly in any one of the node in cluster.
can any one provide me the solution.
Thanks in advance.
Not sure I understand what you want to do.
The way I see it, it can't work:
Oozie server has access to HDFS files only (same as Hive)
your data is on a local filesystem somewhere
So why don't you load your files into HDFS beforehand? The transfer may be triggered either when the files are available (post-processing action in the upstream job) or at fixed time (using Linux CRON).
You don't even need the Hadoop libraries on the Linux box if the WebHDFS service is active on your NameNode - just use CURL and a HTTP upload.

How do I add files to distributed cache in an oozie job

I am implementing an oozie workflow where, in the first job I am reading data from a database using sqoop and writing it to hdfs. In the second job I need to read a large amount of data and use the files I just wrote in job one to process the large data. Here's what I thought of or tried:
Assuming job one writes the files to some directory on hdfs, adding the files to distributed cache in the driver class of job two will not work as oozie workflow knows just about the mapper and reducer classes of the job. (Please correct me if I am wrong here)
I also tried to write to the lib directory of the workflow hoping that the files would then be automatically added to distributed cache but I understood that the lib directory should be read only when the job is running.
I also thought if I could add the files to distributed cache in the setup() of job 2 then I could access them in the mapper/reducer. I am not aware of how one can add files in setup(), is it possible?
How else can I read the output files of the previous job in the subsequent job from distributed cache. I am already using the input directory of job two to read the data that needs to be processed so I cannot use that.
I am using Hadoop 1.2.1, Oozie 3.3.2 on Ubuntu 12.04 virtual machine.
Add the below properties to add files or archives to your map-reduce action . Refer to this documentation for details.
<file>[FILE-PATH]</file>
...
<archive>[FILE-PATH]</archive>
You can also give input at java command line as shown below.
<main-class>org.apache.oozie.MyFirstMainClass</main-class>
<java-opts>-Dblah</java-opts>
<arg>argument1</arg>
<arg>argument2</arg>

How to implement Apache storm to monitor HDFS directory

I have a HDFS directory where files will be copied continuously (streaming) from many sources.
How to build a topology for monitoring the HDFS directory, i.e that whenever a new file is created in that directory it should be processed.
You are looking to monitor HDFS file/directory changes.
Take a look this question, which points to existing support in Oozie and HBase:
How to know that a new data is been added to HDFS?
You can send items into your topology for processing when new files are detected by these tools.
Or you can write your own custom logic in storm, listing and checking if new files are added in HDFS periodically. Check out tick tuples support in Storm.

How to trigger Oozie jobs on particular condition?

I have a folder where all my application log files gets stored. If new log file is created in the folder, immediately my oozie should trigger a Flume job which will put my log file into HDFS.
How to trigger Oozie job when new log file is created in the folder ?
Any help on this topic is greatly appreciated !!!
That's not how Oozie works. Oozie is a scheduler, a bit like CRON. First, you specify how often a workflow should run and then you can add a requirement for files being available as an additional requirement.
I think its more of how you place the files in HDFS. You could always have a parameterized oozie job, which could be invoked using Oozie Java API and passing in the name of the file created on HDFS from the client writing to HDFS itself unless streaming.
Every time a oozie workflow is initiated, it runs on a separate thread and this would allow you to call multiple oozie instances with different parameters.

<ask> How to Backup and Restore HDFS

Actually i have develop application which use Hdfs to store image.Now i want to migrate server and setup hadoop again in new server.How i can backup my image file in HDFS (old sever) to HDFS in my new server ?
I've try to use CopyToLocal command to backup and CopyFromLocal to restore, but i've error, when application running, image which i've restore on hdfs can't show on my application.
How to solve this ?
Thanks
Distcp is the command to use when performing data for large inter/intra-cluster copying. Here is the documentation for the same.
CopyToLocal and CopyFromLocal should also work well for small amounts of data. Run the HDFS CLI and make sure that the files are there. Then it might be a problem with the application.

Resources