Mahout RowSimilarityJob on big document corpus - hadoop

I am trying to use Mahout to compute row similarities in a matrix containing 2 million rows. This matrix is produced taking the output of mahout ssvd with rank 400 and then using mahout rowid to transform it in the appropriate format for RowSimilarityJob.
When the job reaches the CooccurrencesMapper it starts emitting a vector for each couple of non-zero elements in a column. After a couple of hours the job starts failing.
I tried to use the --maxObservationsPerColumn parameter, but if i set it to a value too low the results are incorrect, if i set it to a value higher than 30000 it fails.
The problem is that for each map input it emits ~10GB of values.
I invoke the job like this:
mahout/bin/mahout rowsimilarity -Dmapred.reduce.child.java.opts=-Xmx4G
-Dmapred.map.child.java.opts=-Xmx4G -Dmapred.reduce.tasks=20 -Dio.sort.mb=1000 -Dio.sort.factor=3 --tempDir test_row_similarity --input "UHalfSigma_matrix_400/matrix" --output "doc_lsa_similarity_test/" --numberOfColumns 400
--maxSimilaritiesPerRow 35 --maxObservationsPerColumn 50000 -tr 0.4 --similarityClassname SIMILARITY_COSINE -ow --excludeSelfSimilarity true
I use a cluster of 3 nodes:
- cpu 12 core, RAM 64 GB, HDD 256GB
- cpu 12 core, RAM 128 GB, HDD 2 TB
- cp 8 core, RAM 24 GB, HDD 4 TB
Thank you for your help

Related

dask 100GB dataframe sorting / set_index on new column out of memory issues

I have a dask dataframe of around 100GB and 4 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster.
I converted the dataframe to 150 partiitions (700MB each). However
My simple set_index() operation fails with error "95% memory reached"
g=dd.read_parquet(geodata,columns=['lng','indexCol'])
g.set_index('lng').to_parquet('output/geodata_vec_lng.parq',write_index=True )
I tried:
1 worker 4 threads. 55 GB assigned RAM
1 worker 2 threads. 55 GB assigned RAM
1 worker 1 thread. 55 GB assigned RAM
If I make the partitions smaller I get exponentially more shuffling. 100GB is not large. What am I doing wrong?

Improving compute performance of Spark ML ALS

I have a spark job that performs Alternating Least Squares (ALS) on an implicit feedback ratings matrix. I create the ALS object as follows.
val als = new ALS()
.setCheckpointInterval(5)
.setRank(150)
.setAlpha(30.0)
.setMaxIter(25)
.setRegParam(0.001)
.setUserCol("userId")
.setItemCol("itemId")
.setRatingCol("rating")
.setImplicitPrefs(true)
.setIntermediateStorageLevel("MEMORY_ONLY")
.setFinalStorageLevel("MEMORY_ONLY")
The ratings matrix is created and used to fit the ALS model as follows.
val ratingsSchema = StructType(Array(
StructField("userId", IntegerType, nullable = true),
StructField("itemId", IntegerType, nullable = true),
StructField("rating", DoubleType, nullable = true)))
val ratings = spark
.read
.format("parquet")
.schema(ratingsSchema)
.load("/ratings")
.cache()
val model = als.fit(ratings)
There are roughly 150 million unique users and 1 million items in the ratings DataFrame, which has around 850 million rows.
Based on the numbers above, the ratings DataFrame should occupy ~20 GB space in memory when fully loaded. The userFactors DataFrame would be 150 MM x 150 doubles = 180 GB (roughly). The itemFactors DataFrame should be only 1.2GB.
It is taking a really long time for the job to complete (15+ hours). My cluster specs are as follows.
Provider: AWS EMR version 5.14.0
Spark version: 2.3.0
Cluster:
1 MASTER node: m4.xlarge (8 cores, 16GB mem, 32GB storage)
2 CORE nodes: i3.xlarge (4 cores, 30GB mem, 950 GB storage)
20 TASK nodes: r4.4xlarge (16 cores, 122GB mem, 32 GB storage)
Total TASK cores = 320
Total TASK memory: 2440 GB
Based on the numbers above, all DFs should easily fit in memory (there is also TB+ HDFS available, if needed).
Job configuration:
--executor-memory 102g
--num-executors 20
--executor-cores 15
I can see that there are 20 executors running (and a driver too). I have tried caching the ratings DF and also without caching.
How do I tune the system to make it run faster, if possible?
Does anyone have insights into the ALS job? Does it do a lot of shuffling? How can we minimize the shuffle?
The ratings matrix is in parquet format with 200 files stored in a bucket in S3.
Would it work better if I have lots of small instances (say 50) or should I get a few (say, 5) very big instances like r4.16xlarge (64 cores, 488GB mem)?

How to increase hive concurrent mappers to more than 4?

Summary
When I run a simple select count(*) from table query in hive only two nodes in my large cluster are being used for mapping. I would like to use the whole cluster.
Details
I am using a somewhat large cluster (tens of nodes each more than 200 GB RAM) running hdfs and Hive 1.2.1 (IBM-12).
I have a table of several billion rows. When I perform a simple
select count(*) from mytable;
hive creates hundreds of map tasks, but only 4 are running simultaneously.
This means that my cluster is mostly idle during the query which seems wasteful. I have tried ssh'ing to the nodes in use and they are not utilizing CPU or memory fully. Our cluster is backed by Infiniband networking and Isilon file storage neither of which seems very loaded at all.
We are using mapreduce as the engine. I have tried removing any limits to resources that I could find, but it does not change the fact that only two nodes are being used (4 concurrent mappers).
The memory settings are as follows:
yarn.nodemanager.resource.memory-mb 188928 MB
yarn.scheduler.minimum-allocation-mb 20992 MB
yarn.scheduler.maximum-allocation-mb 188928 MB
yarn.app.mapreduce.am.resource.mb 20992 MB
mapreduce.map.memory.mb 20992 MB
mapreduce.reduce.memory.mb 20992 MB
and we are running on 41 nodes. By my calculation I should be able to get 41*188928/20992 = 369 map/reduce tasks. Instead I get 4.
Vcore settings:
yarn.nodemanager.resource.cpu-vcores 24
yarn.scheduler.minimum-allocation-vcores 1
yarn.scheduler.maximum-allocation-vcores 24
yarn.app.mapreduce.am.resource.cpu-vcores 1
mapreduce.map.cpu.vcores 1
mapreduce.reduce.cpu.vcores 1
Is there are way to get hive/mapreduce to use more of my cluster?
How would a go about figuring out the bottle neck?
Could it be that Yarn is not assigning tasks fast enough?
I guess that using tez would improve performance, but I am still interested in why resources utilization is so limited (and we do not have it installed ATM).
Running parallel tasks depends on your memory setting in yarn
for example if you have 4 data nodes and your yarn memory properties are defined as below
yarn.nodemanager.resource.memory-mb 1 GB
yarn.scheduler.minimum-allocation-mb 1 GB
yarn.scheduler.maximum-allocation-mb 1 GB
yarn.app.mapreduce.am.resource.mb 1 GB
mapreduce.map.memory.mb 1 GB
mapreduce.reduce.memory.mb 1 GB
according to this setting you have 4 data nodes so total yarn.nodemanager.resource.memory-mb will be 4 GB that you can use to launch container
and since container can take 1 GB memory so it means at any given point of time you can launch 4 container , one will be used by application master so you can have maximum 3 mapper or reducer tasks can ran at any given point of time since application master,mapper and reducer each is using 1 GB memory
so you need to increase yarn.nodemanager.resource.memory-mb to increase the number of map/reduce task
P.S. - Here we are taking about maximum tasks that can be launched,it may be some less than that also

Yarn: How to utilize full cluster resources?

So I am having a cloudera cluster with 7 worker nodes.
30GB RAM
4 vCPUs
Here are some of my configurations which I found important (from Google) in tuning performance of my cluster. I am running with:
yarn.nodemanager.resource.cpu-vcores => 4
yarn.nodemanager.resource.memory-mb => 17GB (Rest reserved for OS and other processes)
mapreduce.map.memory.mb => 2GB
mapreduce.reduce.memory.mb => 2GB
Running nproc => 4 (Number of processing units available)
Now my concern is, when I look at my ResourceManager, I see Available Memory as 119 GB which is fine. But when I run a heavy sqoop job and my cluster is at its peak it uses only ~59 GB of memory, leaving ~60 GB memory unused.
One way which I see, can fix this unused memory issue is increasing map|reduce.memory to 4 GB so that we can use upto 16 GB per node.
Other way is to increase the number of containers, which I am not sure how.
4 cores x 7 nodes = 28 possible containers. 3 being used by other processes, only 5 are currently being available for sqoop job.
What should be the right config to improve cluster performance in this case. Can I increase the number of containers, say 2 containers per core. And is it recommended?
Any help or suggestions on the cluster configuration would be highly appreciated. Thanks.
If your input data is in 26 splits, YARN will create 26 mappers to process those splits in parallel.
If you have 7 nodes with 2 GB mappers for 26 splits, the repartition should be something like:
Node1 : 4 mappers => 8 GB
Node2 : 4 mappers => 8 GB
Node3 : 4 mappers => 8 GB
Node4 : 4 mappers => 8 GB
Node5 : 4 mappers => 8 GB
Node6 : 3 mappers => 6 GB
Node7 : 3 mappers => 6 GB
Total : 26 mappers => 52 GB
So the total memory used in your map reduce job if all mappers are running at the same time will be 26x2=52 GB. Maybe if you add the memory user by the reducer(s) and the ApplicationMaster container, you can reach your 59 GB at some point, as you said ..
If this is the behaviour you are witnessing, and the job is finished after those 26 mappers, then there is nothing wrong. You only need around 60 GB to complete your job by spreading tasks across all your nodes without needing to wait for container slots to free themselves. The other free 60 GB are just waiting around, because you don't need them. Increasing heap size just to use all the memory won't necessarily improve performance.
Edited:
However, if you still have lots of mappers waiting to be scheduled, then maybe its because your installation insconfigured to calculate container allocation using vcores as well. This is not the default in Apache Hadoop but can be configured:
yarn.scheduler.capacity.resource-calculator :
The ResourceCalculator implementation to be used to compare Resources in the scheduler. The default i.e. org.apache.hadoop.yarn.util.resource.DefaultResourseCalculator only uses Memory while DominantResourceCalculator uses Dominant-resource to compare multi-dimensional resources such as Memory, CPU etc. A Java ResourceCalculator class name is expected.
Since you defined yarn.nodemanager.resource.cpu-vcores to 4, and since each mapper uses 1 vcore by default, you can only run 4 mappers per node at a time.
In that case you can double your value of yarn.nodemanager.resource.cpu-vcores to 8. Its just an arbitrary value it should double the number of mappers.

Hadoop MapReduce2 Optimization in Heterogeneous Cluster

I have this configuration:
Hadoop: v2.7.1 (Yarn)
An input file: Size = 100 GB.
3 Slaves: each has 4 VCORES with Speed = 2 GHz and RAM = 8 GB
5 Slaves: each has 2 VCORES with Speed = 1 GHz and RAM = 2 GB
MapReduce program: WordCount
How can I minimize WordCount execution time by assigning small input splits to the 5 slower slaves and big input splits to the 3 fastest slaves?
For each machine you can determine number of map/reduce slots, so if you want to send less workload to the slower machines you can define, for example 2 map/reduce task slots for each slower machine and 4 map/reduce task slot for each of the fast machines. This way you can control how much work load each different node in the cluster receives.

Resources