how is "cumulative map pmem" calculated? - hadoop

My MR job has the following characteristics :
input size: 2 GB
chunk size: 128 MB
cluster size: 17 nodes
Distribution: MapR
I can see that "Cumulative Map PMem" size as 15 GB when I run my MR job.
And I have got to know that "Cumulative Map PMem" is Physical Memory counter representing total physical memory that has been used while executing MR job.
I would like to know how exactly this "Cumulative Map PMem" is calculated ? How is this shooting up to 15 GB.

Related

dask 100GB dataframe sorting / set_index on new column out of memory issues

I have a dask dataframe of around 100GB and 4 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster.
I converted the dataframe to 150 partiitions (700MB each). However
My simple set_index() operation fails with error "95% memory reached"
g=dd.read_parquet(geodata,columns=['lng','indexCol'])
g.set_index('lng').to_parquet('output/geodata_vec_lng.parq',write_index=True )
I tried:
1 worker 4 threads. 55 GB assigned RAM
1 worker 2 threads. 55 GB assigned RAM
1 worker 1 thread. 55 GB assigned RAM
If I make the partitions smaller I get exponentially more shuffling. 100GB is not large. What am I doing wrong?

How to increase hive concurrent mappers to more than 4?

Summary
When I run a simple select count(*) from table query in hive only two nodes in my large cluster are being used for mapping. I would like to use the whole cluster.
Details
I am using a somewhat large cluster (tens of nodes each more than 200 GB RAM) running hdfs and Hive 1.2.1 (IBM-12).
I have a table of several billion rows. When I perform a simple
select count(*) from mytable;
hive creates hundreds of map tasks, but only 4 are running simultaneously.
This means that my cluster is mostly idle during the query which seems wasteful. I have tried ssh'ing to the nodes in use and they are not utilizing CPU or memory fully. Our cluster is backed by Infiniband networking and Isilon file storage neither of which seems very loaded at all.
We are using mapreduce as the engine. I have tried removing any limits to resources that I could find, but it does not change the fact that only two nodes are being used (4 concurrent mappers).
The memory settings are as follows:
yarn.nodemanager.resource.memory-mb 188928 MB
yarn.scheduler.minimum-allocation-mb 20992 MB
yarn.scheduler.maximum-allocation-mb 188928 MB
yarn.app.mapreduce.am.resource.mb 20992 MB
mapreduce.map.memory.mb 20992 MB
mapreduce.reduce.memory.mb 20992 MB
and we are running on 41 nodes. By my calculation I should be able to get 41*188928/20992 = 369 map/reduce tasks. Instead I get 4.
Vcore settings:
yarn.nodemanager.resource.cpu-vcores 24
yarn.scheduler.minimum-allocation-vcores 1
yarn.scheduler.maximum-allocation-vcores 24
yarn.app.mapreduce.am.resource.cpu-vcores 1
mapreduce.map.cpu.vcores 1
mapreduce.reduce.cpu.vcores 1
Is there are way to get hive/mapreduce to use more of my cluster?
How would a go about figuring out the bottle neck?
Could it be that Yarn is not assigning tasks fast enough?
I guess that using tez would improve performance, but I am still interested in why resources utilization is so limited (and we do not have it installed ATM).
Running parallel tasks depends on your memory setting in yarn
for example if you have 4 data nodes and your yarn memory properties are defined as below
yarn.nodemanager.resource.memory-mb 1 GB
yarn.scheduler.minimum-allocation-mb 1 GB
yarn.scheduler.maximum-allocation-mb 1 GB
yarn.app.mapreduce.am.resource.mb 1 GB
mapreduce.map.memory.mb 1 GB
mapreduce.reduce.memory.mb 1 GB
according to this setting you have 4 data nodes so total yarn.nodemanager.resource.memory-mb will be 4 GB that you can use to launch container
and since container can take 1 GB memory so it means at any given point of time you can launch 4 container , one will be used by application master so you can have maximum 3 mapper or reducer tasks can ran at any given point of time since application master,mapper and reducer each is using 1 GB memory
so you need to increase yarn.nodemanager.resource.memory-mb to increase the number of map/reduce task
P.S. - Here we are taking about maximum tasks that can be launched,it may be some less than that also

Yarn: How to utilize full cluster resources?

So I am having a cloudera cluster with 7 worker nodes.
30GB RAM
4 vCPUs
Here are some of my configurations which I found important (from Google) in tuning performance of my cluster. I am running with:
yarn.nodemanager.resource.cpu-vcores => 4
yarn.nodemanager.resource.memory-mb => 17GB (Rest reserved for OS and other processes)
mapreduce.map.memory.mb => 2GB
mapreduce.reduce.memory.mb => 2GB
Running nproc => 4 (Number of processing units available)
Now my concern is, when I look at my ResourceManager, I see Available Memory as 119 GB which is fine. But when I run a heavy sqoop job and my cluster is at its peak it uses only ~59 GB of memory, leaving ~60 GB memory unused.
One way which I see, can fix this unused memory issue is increasing map|reduce.memory to 4 GB so that we can use upto 16 GB per node.
Other way is to increase the number of containers, which I am not sure how.
4 cores x 7 nodes = 28 possible containers. 3 being used by other processes, only 5 are currently being available for sqoop job.
What should be the right config to improve cluster performance in this case. Can I increase the number of containers, say 2 containers per core. And is it recommended?
Any help or suggestions on the cluster configuration would be highly appreciated. Thanks.
If your input data is in 26 splits, YARN will create 26 mappers to process those splits in parallel.
If you have 7 nodes with 2 GB mappers for 26 splits, the repartition should be something like:
Node1 : 4 mappers => 8 GB
Node2 : 4 mappers => 8 GB
Node3 : 4 mappers => 8 GB
Node4 : 4 mappers => 8 GB
Node5 : 4 mappers => 8 GB
Node6 : 3 mappers => 6 GB
Node7 : 3 mappers => 6 GB
Total : 26 mappers => 52 GB
So the total memory used in your map reduce job if all mappers are running at the same time will be 26x2=52 GB. Maybe if you add the memory user by the reducer(s) and the ApplicationMaster container, you can reach your 59 GB at some point, as you said ..
If this is the behaviour you are witnessing, and the job is finished after those 26 mappers, then there is nothing wrong. You only need around 60 GB to complete your job by spreading tasks across all your nodes without needing to wait for container slots to free themselves. The other free 60 GB are just waiting around, because you don't need them. Increasing heap size just to use all the memory won't necessarily improve performance.
Edited:
However, if you still have lots of mappers waiting to be scheduled, then maybe its because your installation insconfigured to calculate container allocation using vcores as well. This is not the default in Apache Hadoop but can be configured:
yarn.scheduler.capacity.resource-calculator :
The ResourceCalculator implementation to be used to compare Resources in the scheduler. The default i.e. org.apache.hadoop.yarn.util.resource.DefaultResourseCalculator only uses Memory while DominantResourceCalculator uses Dominant-resource to compare multi-dimensional resources such as Memory, CPU etc. A Java ResourceCalculator class name is expected.
Since you defined yarn.nodemanager.resource.cpu-vcores to 4, and since each mapper uses 1 vcore by default, you can only run 4 mappers per node at a time.
In that case you can double your value of yarn.nodemanager.resource.cpu-vcores to 8. Its just an arbitrary value it should double the number of mappers.

Mahout RowSimilarityJob on big document corpus

I am trying to use Mahout to compute row similarities in a matrix containing 2 million rows. This matrix is produced taking the output of mahout ssvd with rank 400 and then using mahout rowid to transform it in the appropriate format for RowSimilarityJob.
When the job reaches the CooccurrencesMapper it starts emitting a vector for each couple of non-zero elements in a column. After a couple of hours the job starts failing.
I tried to use the --maxObservationsPerColumn parameter, but if i set it to a value too low the results are incorrect, if i set it to a value higher than 30000 it fails.
The problem is that for each map input it emits ~10GB of values.
I invoke the job like this:
mahout/bin/mahout rowsimilarity -Dmapred.reduce.child.java.opts=-Xmx4G
-Dmapred.map.child.java.opts=-Xmx4G -Dmapred.reduce.tasks=20 -Dio.sort.mb=1000 -Dio.sort.factor=3 --tempDir test_row_similarity --input "UHalfSigma_matrix_400/matrix" --output "doc_lsa_similarity_test/" --numberOfColumns 400
--maxSimilaritiesPerRow 35 --maxObservationsPerColumn 50000 -tr 0.4 --similarityClassname SIMILARITY_COSINE -ow --excludeSelfSimilarity true
I use a cluster of 3 nodes:
- cpu 12 core, RAM 64 GB, HDD 256GB
- cpu 12 core, RAM 128 GB, HDD 2 TB
- cp 8 core, RAM 24 GB, HDD 4 TB
Thank you for your help

Swapping of the RAM while Shuffling in HADOOP

I work with hadoop 1.1.1. My project is processing more than 6000 documents. My cluster contains 2 nodes: master(CPU:COREi7, RAM:6G) and slave(CPU:COREi3, RAM:12G). The number of MAPPER is 16. When I assign the number of REDUCER more than 1(e.i. 2,...,16) at the phase of shuffling the RAM begins to SWAP and this causes a significant reduction on my system speed.
How can I stop the RAM from swapping?
What is kept in RAM in the process between MAP and REDUCE?
Is there any reference?
Thanks a lot.
So on the master:
6G physical RAM;
2G allocated per process;
8 mappers and 8 reducers can run concurrently;
8x2 + 8x2, 32G memory required if all tasks are maxed out - over 5x your physical amount.
On the slave:
12G physical RAM;
2G allocated per task;
4 mappers, 4 reducers;
4x2 + 4x2, 16G memory required - 50% more than physical.
Now if you're only running a single job at a time, you can set the slowstart configuration property to 1.0 to ensure that the mappers and reducers don't run concurrently and that will help, but you're still maxed out on the master.
I suggest you reduce either the memory allocation to 1G (if you really want that many map / reduce slots on each node), or reduce the maximum number of tasks for both nodes, such that you're closer to the physical amount (if running maxed out).

Resources