Using Hadoop Cluster Remotely - hadoop

I have a web application and 1 remote clusters(It can be one or more). These cluster can be on different machines.
I want to perform following operations from my web application:
1 HDFS Actions :-
Create New Directory
Remove files from HDFS(Hadoop Distributed File System)
List Files present on HDFS
Load File onto the HDFS
Unload File
2 Job Related Actions:-
Submit Map Reduce Jobs
View their status i.e. how much job has completed
Time taken by the job to finish
I need a tool that can help me do these tasks from the web application - via an API, via REST calls etc. I'm assuming that the tool will be running on the same machine( as the web application) and can point to a particular, remote cluster.
Though as a last option(as there can be multiple,disparate clusters, it would be difficult to ensure that each of them has the plug-in,library etc. installed), I'm wondering if there would be some Hadoop library,plug-in that rests on the cluster,allows access from remote machines and performs the mentioned tasks.

The best framework which allows everything you have listed here is Spring Data - Apache Hadoop. This has Java Scripting API based implementations to do the following
1 HDFS Actions :-
Create New Directory
Remove files from HDFS(Hadoop Distributed File System)
List Files present on HDFS
Load File onto the HDFS
Unload File
As well spring scheduling based implementations to do the following
2 Job Related Actions:-
Submit Map Reduce Jobs
View their status i.e. how much job has comleted
Time taken by the job to finish

Related

Find and set Hadoop logs to verbose level

I need to track what is happening when I run a job or upload a file to HDFS. I do this using sql profiler in sql server. However, I miss such a tool for hadoop and so I am assuming that I can get some information from logs. I thing all logs are stored at /var/logs/hadoop/ but I am confused with what file I need to look at and how to set that file to capture detailed level information.
I am using HDP2.2.
Thanks,
Sree
'Hadoop' represents an entire ecosystem of different products. Each one has its own logging.
HDFS consists of NameNode and DataNode services. Each has its own log. Location of logs is distribution dependent. See File Locations for Hortonworks or Apache Hadoop Log Files: Where to find them in CDH, and what info they contain for Cloudera.
In Hadoop 2.2, MapReduce ('jobs') is a specific application in YARN, so you are talking about ResourceManager and NodeManager services (the YARN components), each with its own log, and then there is the MRApplication (the M/R component), which is a YARN applicaiton yet with its own log.
Jobs consists of taks, and tasks themselves have their own logs.
In Hadoop 2 there is a dedicated Job History service tasked with collecting and storing the logs from the jobs executed.
Higher level components (eg. Hive, Pig, Kafka) have their own logs, asside from the logs resulted from the jobs they submit (which are logging as any job does).
The good news is that vendor specific distribution (Cloudera, Hortonworks etc) will provide some specific UI to expose the most common logs for ease access. Usually they expose the JobHistory service collected logs from the UI that shows job status and job history.
I cannot point you to anything SQL Profiler equivalent, because the problem space is orders of magnitude more complex, with many different products, versions and vendor specific distributions being involved. I recommend to start by reading about and learning how the Job History server runs and how it can be accessed.

how to load text files into hdfs through oozie workflow in a cluster

I am trying to load text/csv files in hive scripts with oozie and schedule it on daily basis. Text files are in local unix file system.
I need to put those text files into hdfs before executing the hive scripts in a oozie workflow.
In a real time cluster we don't know job will run on which node.it will run randomly in any one of the node in cluster.
can any one provide me the solution.
Thanks in advance.
Not sure I understand what you want to do.
The way I see it, it can't work:
Oozie server has access to HDFS files only (same as Hive)
your data is on a local filesystem somewhere
So why don't you load your files into HDFS beforehand? The transfer may be triggered either when the files are available (post-processing action in the upstream job) or at fixed time (using Linux CRON).
You don't even need the Hadoop libraries on the Linux box if the WebHDFS service is active on your NameNode - just use CURL and a HTTP upload.

How do I add files to distributed cache in an oozie job

I am implementing an oozie workflow where, in the first job I am reading data from a database using sqoop and writing it to hdfs. In the second job I need to read a large amount of data and use the files I just wrote in job one to process the large data. Here's what I thought of or tried:
Assuming job one writes the files to some directory on hdfs, adding the files to distributed cache in the driver class of job two will not work as oozie workflow knows just about the mapper and reducer classes of the job. (Please correct me if I am wrong here)
I also tried to write to the lib directory of the workflow hoping that the files would then be automatically added to distributed cache but I understood that the lib directory should be read only when the job is running.
I also thought if I could add the files to distributed cache in the setup() of job 2 then I could access them in the mapper/reducer. I am not aware of how one can add files in setup(), is it possible?
How else can I read the output files of the previous job in the subsequent job from distributed cache. I am already using the input directory of job two to read the data that needs to be processed so I cannot use that.
I am using Hadoop 1.2.1, Oozie 3.3.2 on Ubuntu 12.04 virtual machine.
Add the below properties to add files or archives to your map-reduce action . Refer to this documentation for details.
<file>[FILE-PATH]</file>
...
<archive>[FILE-PATH]</archive>
You can also give input at java command line as shown below.
<main-class>org.apache.oozie.MyFirstMainClass</main-class>
<java-opts>-Dblah</java-opts>
<arg>argument1</arg>
<arg>argument2</arg>

Is it possible for Hadoop MapReduce programs to access local resource?

Could Hadoop framework (or runtime) prevent (or constrain) an application MapReduce program from accessing local resource like local filesystem?
I guess the answer should be true, especially when the MapReduce program is running a cluster.
A secure (Kerberized) cluster will run containers under the user that submitted the job. Ordinary access control can then isolate this user access to local resources.
Non secure clusters run containers as the NM (I'm talking a modern Yarn cluster, not a 1.x version).
Most recent Hadoop version (2.6, very soon to be released) contains YARN-1964 which allows Docker based containers. They are fully isolated (Docker) but this was committed in 2.6 on 2014-11-12, so is about 2 week mature. You'll be living on the edge.
Ofcourse, MapReduce will use local resources in Map/Reduce phase.
The output of Map will be stored in local file system and then suffle and sort will happen.
Next the data will be input to the Reduce phase.
You can specify the path for local path to store the intermediate results of Map by property in Hadoop V1 mapred.local.dir
Hadoop V2,
From the Docs,
Property : mapreduce.cluster.local.dir
Value : ${hadoop.tmp.dir}/mapred/local
Description : The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored.
Hope it helps!

Hadoop with Hive

We want to develop one simple Java EE web application with log file analysis using Hadoop. The following are Approach following to develop the application. But we are unable to through the approach.
Log file would be uploaded into Hadoop server from client machine using sftp/ftp.
Call the Hadoop Job to fetch the log file and process the log file into HDFS file system.
While processing the log file the content will stored into HIVE database.
Search the log content by using HIVE JDBC connection from client web application
We browsed so many sample to full fill some of the steps. But we are not having any concrete sample are not available.
Please suggest the above approach is correct or not and get the links for sample application developed in Java.
I would point out a few thing:
a) You need to merge log files or in some other ways take care that you do not have too much of them. Consider Flume (http://flume.apache.org/) which is built to accept logs from various sources and put them into HDFS.
b) If you go with ftp - you will need some scripting to take data from FTP and put into HDFS.
c) Main problem I see is- to run Hive job as result of the client's web request. Hive request is not interactive - it will take at least dozens of seconds, and probably much more.
I also would be vary of concurrent requests - you proabbly can not run more then a few in parallel
According to me, you can do one thing that:
1)Instead of accepting logs from various sources and put them into HDFS, You can put into one database say SQL Server and from that you can import your data into Hive (or HDFS) using Sqoop.
2) This will reduce your effort for writing the various job to bring the data into HDFS.
3) Once the data come in Hive, you can do whatever you want.

Resources