Customers able to upload urls in any time to database and application should processes urls as soon as possible. So i need periodic hadoop jobs running or run hadoop job automatically from other application(any script identifies new links were added, generates data for hadoop job and runs job). For PHP or Python script, i could set up cronjob, but what is best practice for periodic hadoop jobs running (prepare data for hadoop, upload data, run hadoop job and move data back to database?
Take a look at Oozie, the new workflow system from Y!, which can run jobs based on different triggers. A good overflow is presented by Alejandro here: http://www.slideshare.net/ydn/5-oozie-hadoopsummit2010
If you want urls to be processed as soon as possible, you'll have them processed each at a time. My recommendation is to wait for some number of links (or MB of links, or for example 10 min, every day).
And batch process them (I do my processing daily, but that jobs takes few hours)
Related
Following 1 and 2:
Different types of files enter my NFS directory from time to time. I would like to use OOZIE or any other HDFS solution to trigger the file arrival event and to copy the file into specific location at the HDFS in accordance to its type. What is the best way to do it?
Best way is very subjective term. It largely depends on, what kind of data, frequency and what sorts of things should happen once the data arrive at specific location.
Apache flume can monitor specific folder for data availability and push it down to any sink like HDFS as-is. Flume is good for streaming data.But it does only one specific job- just moving data from place to place.
But on other hand, look up Oozie Coordinators. Coordinators have data availability trigger and with oozie you can perform all sort of ETL operations after data arrives using tools like spark,hive,pig etc and push it down to hdfs using shell actions. You can schedule jobs to run during specific times,frequency or have job send you an email if something goes wrong...
I understand the concepts of HDFS and Map Reduce and how it is important to move the processing logic to the data to increase efficiency. I was even able to run a couple of map reduce job on my basic Hadoop cluster. Surrounding these concepts there are a lot of different technologies like YARN, HUE, OOZIE all of which seems to do the same thing (at least from a very high level) which is operation visibility and CRUD abilities for jobs (which can be map-reduce or something else).
Am I correct in making this assumption or is there a much more fundamental difference between them?
Thanks
Kay
YARN - Map Reduce is API where you have to implement data processing logic in it. Once the code is compiled you have to submit the jobs using hadoop jar command. YARN is the framework which will keep track of the resources, submit the job on the cluster, execute the job, show/log the progress.
OOZIE - Take a data integration example. You might have to get a data set from one database and other data set from other database, then you want to join, process the data and reload it into a cache or 3rd database. It involves 2 sqoop jobs to pull data from database, a hive/map reduce job to join and process the data, then push into cache/database. All these jobs are dependent on each other, eg: we are supposed to process the data only after data is pulled from source databases. Hence we need to create a workflow to execute complete data integration process. OOZIE can facilitate that. It is map reduce based workflow tool. Workflow it self will be executed as one or more map reduce jobs.
HUE: There are many tools in Hadoop - HDFS (file system), Sqoop, Hive/pig to process the data, Impala, HBase and many many more. To execute the POCs, it can get tedious to connect to the cluster. Also it need some linux skills. To overcome those challenges all the Hadoop eco system tools are consolidate under one umbrella - called Hue.
i am using hadoop to process bigdata, i first load data to hdfs and then execute jobs, but it is sequential. Is it possible to do it in parallel. For example,
running 3 jobs and 2 process of load data from others jobs at same time on my cluster.
Cheers
It is possible to run the all job's in parallel in hadoop if your cluster and jobs satisfies the below criteria:
1) Hadoop Cluster should have capability to run reasonable number of map/reduce task(depends on jobs) in parallel(i.e. should have enough map/reduce slots).
2) If jobs that is currently being run , depends on the data which is loaded through another process, we cannot run data load and job in parallel.
If you process satisfies the above condition, you can all the jobs in parallel.
Using Oozie you can schedule all the process to run in parallel. Fork and Join properties in Oozie allows you to accomplish the task to run in parallel.
If your cluster has enough resources to run the jobs in parallel, then yes. But be sure that the work of each job, doesn't interfere with the others. Like load the data at the same time that another job in execution should be using it, that won't work as you expected.
If there is not enough resources, then hadoop will enqueue the jobs until the resources are available, depending on the Scheduler configured.
Lets say I am resource constrained in my Hadoop environment and I don't want to schedule really long running jobs (ie takes days to complete). I am analyzing vast amount of past time series data. I want to schedule mapreduce jobs that take a day's worth of data at a time (which takes an hour to crunch).
So how do I schedule such that new job is submitted as soon as previous job is completed?
If you want a quick and simple approach you could just write a shell script that calls hadoop jar in sequence for each job you want to run.
If you want a more robust approach you could use Apache Oozie to define a workflow of jobs that will run your jobs in sequence. If you are new to Hadoop you may find it easiest to define and run your Oozie workflow using the Hue GUI.
I ran a pig work-flow using oozie. The job completed successfully but now I want to know on which slave or master the job ran. My input file is a 1.4GB file which is distributed on the nodes (1 master and 2 slaves).
And I also want to figure out how much time did the pig executed on each node.
Thank you in advance
Point your web browser to "JobTracker_Machine:50030" and it will presetn you the MapReduce webUI. Here you'll find all the jobs you have run(Running, Completed and Retired). Click on the job which you want to analyze and it will give you all the information you need including the node where a particular task has run and the time taken to finish the task.
HTH
Go to the Oozie Web console and click on the workflow (which contains the pig node).Clicking on the worklfow job will lead to a dialog box (for your workflow) containing details of all the action nodes in the workflow. Select the pig node (which you want to analyse) and a detailed dialog box will appear containing the Job Tracker's URL of that pig job.
There you will find all the details you are looking for.