Load and process data in parallel inside Hadoop - hadoop

i am using hadoop to process bigdata, i first load data to hdfs and then execute jobs, but it is sequential. Is it possible to do it in parallel. For example,
running 3 jobs and 2 process of load data from others jobs at same time on my cluster.
Cheers

It is possible to run the all job's in parallel in hadoop if your cluster and jobs satisfies the below criteria:
1) Hadoop Cluster should have capability to run reasonable number of map/reduce task(depends on jobs) in parallel(i.e. should have enough map/reduce slots).
2) If jobs that is currently being run , depends on the data which is loaded through another process, we cannot run data load and job in parallel.
If you process satisfies the above condition, you can all the jobs in parallel.
Using Oozie you can schedule all the process to run in parallel. Fork and Join properties in Oozie allows you to accomplish the task to run in parallel.

If your cluster has enough resources to run the jobs in parallel, then yes. But be sure that the work of each job, doesn't interfere with the others. Like load the data at the same time that another job in execution should be using it, that won't work as you expected.
If there is not enough resources, then hadoop will enqueue the jobs until the resources are available, depending on the Scheduler configured.

Related

What is the difference between HUE, YARN and OOZIE

I understand the concepts of HDFS and Map Reduce and how it is important to move the processing logic to the data to increase efficiency. I was even able to run a couple of map reduce job on my basic Hadoop cluster. Surrounding these concepts there are a lot of different technologies like YARN, HUE, OOZIE all of which seems to do the same thing (at least from a very high level) which is operation visibility and CRUD abilities for jobs (which can be map-reduce or something else).
Am I correct in making this assumption or is there a much more fundamental difference between them?
Thanks
Kay
YARN - Map Reduce is API where you have to implement data processing logic in it. Once the code is compiled you have to submit the jobs using hadoop jar command. YARN is the framework which will keep track of the resources, submit the job on the cluster, execute the job, show/log the progress.
OOZIE - Take a data integration example. You might have to get a data set from one database and other data set from other database, then you want to join, process the data and reload it into a cache or 3rd database. It involves 2 sqoop jobs to pull data from database, a hive/map reduce job to join and process the data, then push into cache/database. All these jobs are dependent on each other, eg: we are supposed to process the data only after data is pulled from source databases. Hence we need to create a workflow to execute complete data integration process. OOZIE can facilitate that. It is map reduce based workflow tool. Workflow it self will be executed as one or more map reduce jobs.
HUE: There are many tools in Hadoop - HDFS (file system), Sqoop, Hive/pig to process the data, Impala, HBase and many many more. To execute the POCs, it can get tedious to connect to the cluster. Also it need some linux skills. To overcome those challenges all the Hadoop eco system tools are consolidate under one umbrella - called Hue.

Hadoop schedule jobs to run sequentially (one job after other)?

Lets say I am resource constrained in my Hadoop environment and I don't want to schedule really long running jobs (ie takes days to complete). I am analyzing vast amount of past time series data. I want to schedule mapreduce jobs that take a day's worth of data at a time (which takes an hour to crunch).
So how do I schedule such that new job is submitted as soon as previous job is completed?
If you want a quick and simple approach you could just write a shell script that calls hadoop jar in sequence for each job you want to run.
If you want a more robust approach you could use Apache Oozie to define a workflow of jobs that will run your jobs in sequence. If you are new to Hadoop you may find it easiest to define and run your Oozie workflow using the Hue GUI.

How to run Hue Hive Queries sequentially

I have set up Cloudera Hue and have a cluster of master node of 200 Gib and 16 Gib RAM and 3 datnodes of each 150 Gib and 8 Gib Ram.
I have database of size 70 Gib approx. The problem is when I try to run Hive queries from hive editor(HUE GUI). If I submit 5 to 6 queries(for execution) Jobs are started but they hang and never run. How can I run the queries sequentially. I mean even though I can submit queries but the new query should only start when previous is completed. Is there any way so that I can make the queries run one by one?
You can run all your queries in one go and by separating them using ';' in HUE.
For example:
Query1;
Query2;
Query3
In this case query1, query2 and query3 will run sequentially one after another
Hue submits all the queries, if they hang, it means that you are probably hitting a misconfiguration in YARN, like gotcha #5 http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/
so the entire flow of YARN/MR2 is as follow
query is submitted from HUE Hive query editor
job is started and resource manager starts an application master on one of datanode
this application master asks for the resources to resource manager(eg 2 * 1Gib/ 1 Core)
resource manager provides these resources( called nodemanagers which then runs the map and
reduce tasks) to application master.
so now resource allocation is handled by YARN.in case of cloudera cluster, Dynamic resource pools(kind of a queue) is the place where jobs are submitted and and then resource allocation is done by yarn for these jobs. by default the value of maximum concurrent jobs is set in such a way that resource manager allocates all the resource to all the jobs/Application masters leaving no space for task containers(which is required at later stage for running tasks by application masters.)
http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/introduction-to-yarn-and-mapreduce-2-slides.html
so if we submit large no of queries in HUE Hive editor for execution they will be submitted as jobs concurrently and application masters for them will be allocated resources leaving no space for task containers and thus all jobs will be in pending state.
Solution is as mentioned above by #Romain
set the value of max no of concurrent jobs accordingly to the size and capability of cluster. in my case it worked for the value of 4
now only 4 jobs will be run concurrently from the pool and they will be allocated resources by the resource manager.

Run hive queries, and collect job information

I would like to run a list of generated HIVE queries.
For each, I would like to retrieve the MR job_id (or ids, in case of multiple stages).
And then, with this job_id, collect statistics from job tracker (cumulative CPU, read bytes...)
How can I send HIVE queries from a bash or python script, and retrieve the job_id(s) ?
For the 2nd part (collecting stats for the job), we're using a MRv1 Hadoop cluster, so I don't have the AppMaster REST API. I'm about to collect data from the jobtracker web UI. Any better idea ?
you can get the list of jobs executed by running this command,
hadoop job -list all
then for each job-id, you can retrieve the stats, using the command,
hadoop job -status job-id
And for associating the jobs with a query, you can get the job_name and match it with the query.
something like this,
How to get names of the currently running hadoop jobs?
hope this helps.

Periodic hadoop jobs running (best practice)

Customers able to upload urls in any time to database and application should processes urls as soon as possible. So i need periodic hadoop jobs running or run hadoop job automatically from other application(any script identifies new links were added, generates data for hadoop job and runs job). For PHP or Python script, i could set up cronjob, but what is best practice for periodic hadoop jobs running (prepare data for hadoop, upload data, run hadoop job and move data back to database?
Take a look at Oozie, the new workflow system from Y!, which can run jobs based on different triggers. A good overflow is presented by Alejandro here: http://www.slideshare.net/ydn/5-oozie-hadoopsummit2010
If you want urls to be processed as soon as possible, you'll have them processed each at a time. My recommendation is to wait for some number of links (or MB of links, or for example 10 min, every day).
And batch process them (I do my processing daily, but that jobs takes few hours)

Resources