I'm running a hive tez job. the job is to load the data from one table which is of text file format to another table with orc format.
I'm using
INSERT INTO TABLE ORDERREQUEST_ORC
PARTITION(DATE)
SELECT
COLUMN1,
COLUMN2,
COLUMN3,
DATE
FROM ORDERREQUEST_TXT;
When I'm monitoring the job through ambari web console I saw that YARN memory utilized is 100%.
can you please advice how to maintain Healthy Yarn memory.
the load average on all the three datanodes;
1. top - 17:37:24 up 50 days, 3:47, 4 users, load average: 15.73, 16.43, 13.52
2. top - 17:38:25 up 50 days, 3:48, 2 users, load average: 16.14, 15.19, 12.50
3. top - 17:39:26 up 50 days, 3:49, 1 user, load average: 11.89, 12.54, 10.49
These are the yarn configurations
yarn.scheduler.minimum-allocation-mb=5120
yarn.scheduler.maximum-allocation-mb=46080
yarn.nodemanager.resource.memory-mb=46080
FYI:- My cluster config
Nodes = 4 (1 Master, 3 DN )
memory = 64 GB on each node
Processors = 6 on each node
1 TB on each node (5 Disk * 200 GB)
How to reduce the yarn utilization memory?
you are getting the error because the cluster hasn't been configured to allocate max yarn memory per user.
Please set the below properties in Yarn configurations to allocate 33% of max yarn memory per job, which can be altered based on your requirement.
Change from:
yarn.scheduler.capacity.root.default.user-limit-factor=1
To:
yarn.scheduler.capacity.root.default.user-limit-factor=0.33
If you need further info on this, please refer following link
https://analyticsanvil.wordpress.com/2015/08/16/managing-yarn-memory-with-multiple-hive-users/
Related
I have migrated a portion of C application to process on DataProc using PySpark Jobs (Reading and writing into Big Query - Amount of data - around 10 GB) . The C application that is running in 8 minutes in local data centre taking around 4 Hrs on Data Proc . Could someone advise me the optimal Data Proc configuration ? At present I am using below one :
--master-machine-type n2-highmem-32 --master-boot-disk-type pd-ssd --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n2-highmem-32 --worker-boot-disk-type pd-ssd --worker-boot-disk-size 500 --image-version 1.4-debian10
Will really appreciate any help on optimal dataproc configuration .
Thanks,
RP
Here are some good articles on job performance tuning on Dataproc: Spark job tuning tips and 10 questions to ask about your Hadoop and Spark cluster performance.
Currently i am using a cluster with following configurations:
1 namenode and 5 datanodes. Each datanodes have 8.7TB of hardisk and 32GB ram.
When i try to execute a map-reduce job above 300GB i get a error but when i try to execute a job with same code for dataset below 300GB its get executed without any problem. Looks like my cluster can not process above 300GB of data, is that the case? Can i process a map-reduce job for dataset above 300GB , what configurations do i need to change? Do i need to do changes in my Drivers?
It takes 6 seconds to return json of 9000 datapoints.
I have approximately 10GB of Data in 12 metrics say x.open, x.close...
Data Storage pattern:
Metric : x.open
tagk : symbol
tagv : stringValue
Metric : x.close
tagk : symbol
tagv : stringValue
My Configurations are on Cluster Setup as follows
Node 1: (Real 16GB ActiveNN) JournalNode, Namenode, Zookeeper, RegionServer, HMaster, DFSZKFailoverController, TSD
Node 2: (VM 8GB StandbyNN) JournalNode, Namenode, Zookeeper, RegionServer
Node 3: (Real 16GB) Datanode, RegionServer, TSD
Node 4: (VM 4GB) JournalNode, Datanode, Zookeeper, RegionServer
the setup is for POC/ Dev not for production.
Wideness of timestamp is like, one datapoint each for a day for each symbol under easch metric from 1980 to today..
If the above statement is confusing ( My 12 metrics would get 3095 datapoints added everyday in a continuous run one for each symbol.)
Cardinality for tag values in current scenario is 3095+ symbols
Query Sample:
http://myIPADDRESS:4242/api/query?start=1980/01/01&end=2016/02/18&m=sum:stock.Open{symbol=IBM}&arrays=true
Debugger Result:
8.44sec; datapoints retrieved 8859; datasize: 55kb
Data writing speed is also slow, it takes 6.5 hours to write 2.2 million datapoints.
Am I wrong somewhere with Configurations or expecting much ?
Method for Writing: Json objects via Http
Salting Enabled: Not yet
too much data in one metric will cause performance down. The result may be 9000 data point but the raw data set may be too big. The performance of retrieving 9000 data points from one million will be very different from retrieving 9000 data points from one billion.
I have set up Cloudera Hue and have a cluster of master node of 200 Gib and 16 Gib RAM and 3 datnodes of each 150 Gib and 8 Gib Ram.
I have database of size 70 Gib approx. The problem is when I try to run Hive queries from hive editor(HUE GUI). If I submit 5 to 6 queries(for execution) Jobs are started but they hang and never run. How can I run the queries sequentially. I mean even though I can submit queries but the new query should only start when previous is completed. Is there any way so that I can make the queries run one by one?
You can run all your queries in one go and by separating them using ';' in HUE.
For example:
Query1;
Query2;
Query3
In this case query1, query2 and query3 will run sequentially one after another
Hue submits all the queries, if they hang, it means that you are probably hitting a misconfiguration in YARN, like gotcha #5 http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/
so the entire flow of YARN/MR2 is as follow
query is submitted from HUE Hive query editor
job is started and resource manager starts an application master on one of datanode
this application master asks for the resources to resource manager(eg 2 * 1Gib/ 1 Core)
resource manager provides these resources( called nodemanagers which then runs the map and
reduce tasks) to application master.
so now resource allocation is handled by YARN.in case of cloudera cluster, Dynamic resource pools(kind of a queue) is the place where jobs are submitted and and then resource allocation is done by yarn for these jobs. by default the value of maximum concurrent jobs is set in such a way that resource manager allocates all the resource to all the jobs/Application masters leaving no space for task containers(which is required at later stage for running tasks by application masters.)
http://www.cloudera.com/content/cloudera/en/resources/library/recordedwebinar/introduction-to-yarn-and-mapreduce-2-slides.html
so if we submit large no of queries in HUE Hive editor for execution they will be submitted as jobs concurrently and application masters for them will be allocated resources leaving no space for task containers and thus all jobs will be in pending state.
Solution is as mentioned above by #Romain
set the value of max no of concurrent jobs accordingly to the size and capability of cluster. in my case it worked for the value of 4
now only 4 jobs will be run concurrently from the pool and they will be allocated resources by the resource manager.
I am new to Hadoop and Pig.
I have setup Hadoop cluster with 3 node. I have written a Pig script which is normally reading data and executing aggregated functions on the it.
When I am executing 4.8G file with 36 Million Records pig is giving output in 51 minutes.
When I am executing 9.6G file with 72 Million Records pig script is crashing and Hadoop is giving following error.
Unable to recreate exception from backed error: AttemptID:attempt_1389348682901_0050_m_000005_3 Info:Container killed by the ApplicationMaster.
Job failed, hadoop does not return any error message
I am using Hadoop 2.2.0 and Pig 0.12.0.
My nodes configuration are
Master: 2 CPU, 2 GB RAM
Slave1: 2 CPU, 2 GB RAM
Slave2: 1 CPU, 2 GB RAM
Could you please advice me on this?
After trying things with Pig. I moved to Hive.
What I observed when I was using Pig:
I was uploading file in HDFS and loading it in Pig. So Pig was again loading that file. I was processing file twice.
For my scenario Hive fits. I am uploading file in HDFS and loading that file in Hive. It takes few milliseconds. Because Hive is seamlessly working with HDFS files. So no need to load data again in Hive tables. That saves lots of time.
Both components are good, for me Hive fits.
Thanks all for your time and advice.