How to run Hue Hive Queries sequentially - hadoop

I have set up Cloudera Hue and have a cluster of master node of 200 Gib and 16 Gib RAM and 3 datnodes of each 150 Gib and 8 Gib Ram.
I have database of size 70 Gib approx. The problem is when I try to run Hive queries from hive editor(HUE GUI). If I submit 5 to 6 queries(for execution) Jobs are started but they hang and never run. How can I run the queries sequentially. I mean even though I can submit queries but the new query should only start when previous is completed. Is there any way so that I can make the queries run one by one?

You can run all your queries in one go and by separating them using ';' in HUE.
For example:
Query1;
Query2;
Query3
In this case query1, query2 and query3 will run sequentially one after another

Hue submits all the queries, if they hang, it means that you are probably hitting a misconfiguration in YARN, like gotcha #5 http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/

so the entire flow of YARN/MR2 is as follow
query is submitted from HUE Hive query editor
job is started and resource manager starts an application master on one of datanode
this application master asks for the resources to resource manager(eg 2 * 1Gib/ 1 Core)
resource manager provides these resources( called nodemanagers which then runs the map and
reduce tasks) to application master.
so now resource allocation is handled by YARN.in case of cloudera cluster, Dynamic resource pools(kind of a queue) is the place where jobs are submitted and and then resource allocation is done by yarn for these jobs. by default the value of maximum concurrent jobs is set in such a way that resource manager allocates all the resource to all the jobs/Application masters leaving no space for task containers(which is required at later stage for running tasks by application masters.)
http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/introduction-to-yarn-and-mapreduce-2-slides.html
so if we submit large no of queries in HUE Hive editor for execution they will be submitted as jobs concurrently and application masters for them will be allocated resources leaving no space for task containers and thus all jobs will be in pending state.
Solution is as mentioned above by #Romain
set the value of max no of concurrent jobs accordingly to the size and capability of cluster. in my case it worked for the value of 4
now only 4 jobs will be run concurrently from the pool and they will be allocated resources by the resource manager.

Related

How can i increase size of processing input data file in map-reduce job?

Currently i am using a cluster with following configurations:
1 namenode and 5 datanodes. Each datanodes have 8.7TB of hardisk and 32GB ram.
When i try to execute a map-reduce job above 300GB i get a error but when i try to execute a job with same code for dataset below 300GB its get executed without any problem. Looks like my cluster can not process above 300GB of data, is that the case? Can i process a map-reduce job for dataset above 300GB , what configurations do i need to change? Do i need to do changes in my Drivers?

Spark Job gets stuck at 99.7%

I'm trying to perform a simple join operation using Talend & Spark. The input data set is a few million records and the look up data set is around 100 records.(we might need to join with million records look up data too).
When trying to just read the input data and generate a flat file with the following memory settings, the job works fine and takes less amount of time to run. But, when trying to perform a join operation as explained above, the job gets stuck at 99.7%.
ExecutorMemory = 20g
Cores Per Executor = 4
Yarn resources allocation = Fixed
Num of executors = 100
spark.yarn.executor.memoryOverhead=6000 (On some preliminary research I found that this has to be 10% of the executor memory, but that didn't help too.)
After a while(30-40 minutes) the job prints a log saying - "Lost executor xx on abc.xyz.com". This is probably because it's put on wait for too long and the executor gets killed.
I'm trying to check if anyone has run into this issue where a Spark job gets stuck at 99.7% for a simple operation. Also, what are the recommended tuning properties to use in such a scenario.

How to make a cached from a finished Spark Job still accessible for the other job?

My project is implement a interaction query for user to discover that data. Like we have a list of columns user can choose then user add to list and press view data. The current data store in Cassandra and we use Spark SQL to query from it.
The Data Flow is we have a raw log after be processed by Spark store into Cassandra. The data is time series with more than 20 columns and 4 metrics. Currently I tested because more than 20 dimensions into cluster keys so write to Cassandra is quite slow.
The idea here is load all data from Cassandra into Spark and cache it in memory. Provide a API to client and run query base on Spark Cache.
But I don't know how to keep that cached data persist. I am try to use spark-job-server they have feature call share object. But not sure it works.
We can provide a cluster with more than 40 CPU cores and 100 GB RAM. We estimate data to query is about 100 GB.
What I have already tried:
Try to store in Alluxio and load to Spark from that but the time to load is slow because when it load 4GB data Spark need to do 2 things first is read from Alluxio take more than 1 minutes and then store into disk (Spark Shuffle) cost more than 2 or 3 minutes. That mean is over the time we target under 1 minute. We tested 1 job in 8 CPU cores.
Try to store in MemSQL but kind of costly. 1 days it cost 2GB RAM. Not sure the speed is keeping good when we scale.
Try to use Cassandra but Cassandra does not support GROUP BY.
So, what I really want to know is my direction is right or not? What I can change to archive the goal (query like MySQL with a lot of group by, SUM, ORDER BY) return to client by a API.
If you explicitly call cache or persist on a DataFrame, it will be saved in memory (and/or disk, depending on the storage level you choose) until the context is shut down. This is also valid for sqlContext.cacheTable.
So, as you are using Spark JobServer, you can create a long running context (using REST or at server start-up) and use it for multiple queries on the same dataset, because it will be cached until the context or the JobServer service shuts down. However, using this approach, you should make sure you have a good amount of memory available for this context, otherwise Spark will save a large portion of the data on disk, and this would have some impact on performance.
Additionally, the Named Objects feature of JobServer is useful for sharing specific objects among jobs, but this is not needed if you register your data as a temp table (df.registerTempTable("name")) and cache it (sqlContext.cacheTable("name")), because you will be able to query your table from multiple jobs (using sqlContext.sql or sqlContext.table), as long as these jobs are executed on the same context.

Load and process data in parallel inside Hadoop

i am using hadoop to process bigdata, i first load data to hdfs and then execute jobs, but it is sequential. Is it possible to do it in parallel. For example,
running 3 jobs and 2 process of load data from others jobs at same time on my cluster.
Cheers
It is possible to run the all job's in parallel in hadoop if your cluster and jobs satisfies the below criteria:
1) Hadoop Cluster should have capability to run reasonable number of map/reduce task(depends on jobs) in parallel(i.e. should have enough map/reduce slots).
2) If jobs that is currently being run , depends on the data which is loaded through another process, we cannot run data load and job in parallel.
If you process satisfies the above condition, you can all the jobs in parallel.
Using Oozie you can schedule all the process to run in parallel. Fork and Join properties in Oozie allows you to accomplish the task to run in parallel.
If your cluster has enough resources to run the jobs in parallel, then yes. But be sure that the work of each job, doesn't interfere with the others. Like load the data at the same time that another job in execution should be using it, that won't work as you expected.
If there is not enough resources, then hadoop will enqueue the jobs until the resources are available, depending on the Scheduler configured.

Hadoop Nodes and Roles

I've a Hadoop Cluster at work that has over 50 nodes, We occasionally face disk failures and require to decommission the datanode roles.
My Question is - if I were to only decommission the datanode and leave the tasktracker running, would this result in failed tasks/jobs on this node due to unavailability of HDFS Service on that node?
Does the TaskTracker on Node1 sit idle since there is no DataNode service on that Node? Correct, if the data node is disabled then the task tracker will not be able to process the data as the data will not be avaiable; it will be idle. 2. or Does the TaskTracker work on data from DataNodes on other Nodes? Nope, due to data locality principle, the task tracker will not process the data from other nodes.. 3. Do we get errors from TaskTracker Service on Node1 due to the DN on it's node being down? , Task tracker will not be able to process any data, so no errors.; 4. if I have services like Hive, Impala, etc running on HDFS - would those services throw error upon contact with TaskTracker on Node1? They will not be able to contact the task tracker on node 1. When client requests for the processing of the data, Name node tells the client about the data locations, so based on the data locations all other applications will communicate with data nodes
I would expect any task that tries to read from HDFS on the "dead" node to fail. This should result in the node being blacklisted by M/R after N failures (default is 3 I think). Also, I believe this happens each time a job runs.
However, jobs should still finish since the tasks that got routed to the bad node will simply be retried on other nodes.
Firstly, in order to run a job you need to have the input file. So when you load the input file to HDFS this will be split into 64 MB block size by default. Also there will be 3 replications with default settings. Now since one of your data node in the cluster is failed, Name node will not store the data in that node. Even if it tries to store also, it gets the frequent updates from data node about the status. So it will not choose that specific data node to store the data.
It should throw exception when you don't have the disk space and the only dead data node is left in the cluster. Then its time for you to replace the data node and scale up the cluster.
Hope this helps.

Resources