How do I add files to distributed cache in an oozie job - hadoop

I am implementing an oozie workflow where, in the first job I am reading data from a database using sqoop and writing it to hdfs. In the second job I need to read a large amount of data and use the files I just wrote in job one to process the large data. Here's what I thought of or tried:
Assuming job one writes the files to some directory on hdfs, adding the files to distributed cache in the driver class of job two will not work as oozie workflow knows just about the mapper and reducer classes of the job. (Please correct me if I am wrong here)
I also tried to write to the lib directory of the workflow hoping that the files would then be automatically added to distributed cache but I understood that the lib directory should be read only when the job is running.
I also thought if I could add the files to distributed cache in the setup() of job 2 then I could access them in the mapper/reducer. I am not aware of how one can add files in setup(), is it possible?
How else can I read the output files of the previous job in the subsequent job from distributed cache. I am already using the input directory of job two to read the data that needs to be processed so I cannot use that.
I am using Hadoop 1.2.1, Oozie 3.3.2 on Ubuntu 12.04 virtual machine.

Add the below properties to add files or archives to your map-reduce action . Refer to this documentation for details.
<file>[FILE-PATH]</file>
...
<archive>[FILE-PATH]</archive>
You can also give input at java command line as shown below.
<main-class>org.apache.oozie.MyFirstMainClass</main-class>
<java-opts>-Dblah</java-opts>
<arg>argument1</arg>
<arg>argument2</arg>

Related

how to load text files into hdfs through oozie workflow in a cluster

I am trying to load text/csv files in hive scripts with oozie and schedule it on daily basis. Text files are in local unix file system.
I need to put those text files into hdfs before executing the hive scripts in a oozie workflow.
In a real time cluster we don't know job will run on which node.it will run randomly in any one of the node in cluster.
can any one provide me the solution.
Thanks in advance.
Not sure I understand what you want to do.
The way I see it, it can't work:
Oozie server has access to HDFS files only (same as Hive)
your data is on a local filesystem somewhere
So why don't you load your files into HDFS beforehand? The transfer may be triggered either when the files are available (post-processing action in the upstream job) or at fixed time (using Linux CRON).
You don't even need the Hadoop libraries on the Linux box if the WebHDFS service is active on your NameNode - just use CURL and a HTTP upload.

How to trigger Oozie jobs on particular condition?

I have a folder where all my application log files gets stored. If new log file is created in the folder, immediately my oozie should trigger a Flume job which will put my log file into HDFS.
How to trigger Oozie job when new log file is created in the folder ?
Any help on this topic is greatly appreciated !!!
That's not how Oozie works. Oozie is a scheduler, a bit like CRON. First, you specify how often a workflow should run and then you can add a requirement for files being available as an additional requirement.
I think its more of how you place the files in HDFS. You could always have a parameterized oozie job, which could be invoked using Oozie Java API and passing in the name of the file created on HDFS from the client writing to HDFS itself unless streaming.
Every time a oozie workflow is initiated, it runs on a separate thread and this would allow you to call multiple oozie instances with different parameters.

Using Hadoop Cluster Remotely

I have a web application and 1 remote clusters(It can be one or more). These cluster can be on different machines.
I want to perform following operations from my web application:
1 HDFS Actions :-
Create New Directory
Remove files from HDFS(Hadoop Distributed File System)
List Files present on HDFS
Load File onto the HDFS
Unload File
2 Job Related Actions:-
Submit Map Reduce Jobs
View their status i.e. how much job has completed
Time taken by the job to finish
I need a tool that can help me do these tasks from the web application - via an API, via REST calls etc. I'm assuming that the tool will be running on the same machine( as the web application) and can point to a particular, remote cluster.
Though as a last option(as there can be multiple,disparate clusters, it would be difficult to ensure that each of them has the plug-in,library etc. installed), I'm wondering if there would be some Hadoop library,plug-in that rests on the cluster,allows access from remote machines and performs the mentioned tasks.
The best framework which allows everything you have listed here is Spring Data - Apache Hadoop. This has Java Scripting API based implementations to do the following
1 HDFS Actions :-
Create New Directory
Remove files from HDFS(Hadoop Distributed File System)
List Files present on HDFS
Load File onto the HDFS
Unload File
As well spring scheduling based implementations to do the following
2 Job Related Actions:-
Submit Map Reduce Jobs
View their status i.e. how much job has comleted
Time taken by the job to finish

Run a Hadoop job without output file

Is it possible to run a hadoop job without specifying output file ?
When i try to run a hadoop job , no output file specified Exception is thrown .
can any one please give any procedure to do so using Java.
I am writing the data processed by reduce to a non relational database so i no longer require it to write to HDFS.
Unfortunately, you can't really do this. Writing output is part of the framework. When you work outside of the framework, you basically have to just deal with the consequences.
You can use NullOutputFormat, which doesn't write any data to HDFS. I think it still creates the folder, though. You could always let Hadoop create the folder, then delete it.

How do I set up a distributed map-reduce job using hadoop streaming and ruby mappers/reducers?

I'm able to run a local mapper and reducer built using ruby with an input file.
I'm unclear about the behavior of the distributed system though.
For the production system, I have a HDFS set up across two machines. I know that if I store a large file on the HDFS, it will have some blocks on both machines to allow for parallelization. Do I also need to store the actual mappers and reducer files (my ruby files in this case) on the HDFS as well?
Also, how would I then go about actually running the streaming job so that it runs in a parallel manner on both systems?
If you were to use mapper/reducers written in ruby (or anything other than Java), you would have to use hadoop-streaming. Hadoop streaming has an option to package your mapper/reducer files when sending your job to the cluster. The following link should have what you are looking for.
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html

Resources