mahout ssvd job performance - hadoop

I need to compute ssvd.
For 50 000 x 50 000 matrix, when reducing to 300x300 libraries such as ssvdlibc and other can compute it in less than 3 minutes;
I wanted to do it for big data, tried using mahout. Firstly I tried to run it locally on my small data set (that is 50000 x 50000), but it takes 32 minutes to complete that simple job, uses around 5,5GB of disk space for spill files, cause my intel i5 with 8GiB ram and SSD drive to freeze for few times.
I understand that mahout and hadoop must do lots of additional steps to perform everything as map-reduce job, but the performance hit just seems to big. I think I must have something wrong in my setup.
I've read some hadoop and mahout documentation, added few parameters in my config files, but its still incredibly slow. Most of time it uses only one CPU.
Can someone please tell me is what is wrong with my setup ? Can It be somehow tuned for that simple, one mahine use just to see what to look for for bigger deployment ?
my config files :
mapred-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>local</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx5000M</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>3</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>3</value>
</property>
<property>
<name>io.sort.factor</name>
<value>35</value>
</property>
</configuration>
core-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>file:///</value>
</property>
<!--
<property>
<name>fs.inmemory.size.mb</name>
<value>200</value>
</property>
<property>
<name>io.sort.factor</name>
<value>100</value>
</property>
-->
<property>
<name>io.sort.mb</name>
<value>200</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
I run my job like that:
mahout ssvd --rank 400 --computeU true --computeV true --reduceTasks 3 --input ${INPUT} --output ${OUTPUT} -ow --tempDir /tmp/ssvdtmp/
I also configured hadoop to and mahout with -Xmx=4000m

Well so first of all I would verify that it is running in parallel, make sure hdfs replication is set to "1", and just generally check your params. That only seeing one core be used is definitely an issue!
But!
The problem with slowness is probably not going to go away completely, you might be able to speed it up significantly with proper configuration, but at the end of the day the hadoop model is not going to outcompete optimized shared memory model libraries on a single computer.
The power of hadoop/mahout is for big data, and honestly 50k x 50k is still in the realm of fairly small, easily manageable on a single computer. Essentially, Hadoop trades speed for scalability. So while it might not outcompete those other two with 50000 x 50000, try to get them to work on 300000 x 300000 while with Hadoop you are sitting pretty on a distributed cluster.

Related

Odd Hadoop behavior, master performing all work?

I have setup a cluster using this guide: https://medium.com/#jootorres_11979/how-to-set-up-a-hadoop-3-2-1-multi-node-cluster-on-ubuntu-18-04-2-nodes-567ca44a3b12
Currently I have one datanode and one master node.
What happens when I run a Hadoop job is that, the datanode's network activity indicates that it is sending a lot of data and the namenode receives that data. Also, the namenode's CPU is utilized fully while the datanode's CPU is not used at all. See the figure:
The nodes are VMs on the same machine. This happens for several different scripts, the figure is from running a WordCount algorithm.
Why is the work not being performed on the datanode? What could cause such a behavior?
Any help is appreciated.
According to the guide, mapred-site.xml was not changed. This means that the default values are used. The default for mapreduce.framework.name is "local". This means that all calculations will be performed locally. This must be changed to "yarn".
I created the following mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value> $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*
$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
</value>
</property>
</configuration>
I also had to change yarn-site.xml to:
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
After restarting yarn and Hadoop, everything worked as expected. The work is performed at datanodes.

Convert Avro to Parquet in NiFi

I would like to convert Avro files to Parquet in NiFi. I know it's possible to convert to ORC via the ConvertAvroToORC processor but I didn't found a solution to convert to Parquet.
I'm converting a JSON to Avro via a ConvertRecord (JsonTreeReader and AvroRecordSetWriter) processor. After that I would like to convert the Avro payload to Parquet before I will put it in a S3 bucket. I don't want to store it in HDFS, therefore the PutParquet processor seems not to be applicable.
I would need a processor such as: ConvertAvroToParquet
#Martin, you can use a very handy processor ConvertAvroToParquet which I recently contributed in Nifi. It should be available in latest version.
The purpose of this processor is exactly similar to what you are looking for. For more details on this processor & why it was created : Nifi-5706
Code Link.
Actually it is possible to use the PutParquet processor.
Following description is from a working flow in nifi-1.8.
Place the following libs into a folder e.g. home/nifi/s3libs/:
aws-java-sdk-1.11.455.jar (+ Third-party libs)
hadoop-aws-3.0.0.jar
Create a xml file e.g. /home/nifi/s3conf/core-site.xml. Might need some additional tweaking, use the right endpoint for your zone.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>s3a://BUCKET_NAME</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>ACCESS-KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>SECRET-KEY</value>
</property>
<property>
<name>fs.AbstractFileSystem.s3a.imp</name>
<value>org.apache.hadoop.fs.s3a.S3A</value>
</property>
<property>
<name>fs.s3a.multipart.size</name>
<value>104857600</value>
<description>Parser could not handle 100M. replacing with bytes. Maybe not needed after testing</description>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.eu-central-1.amazonaws.com</value>
<description>Frankfurt</description>
</property>
<property>
<name>fs.s3a.fast.upload.active.blocks</name>
<value>4</value>
<description>
Maximum Number of blocks a single output stream can have
active (uploading, or queued to the central FileSystem
instance's pool of queued operations.
This stops a single stream overloading the shared thread pool.
</description>
</property>
<property>
<name>fs.s3a.threads.max</name>
<value>10</value>
<description>The total number of threads available in the filesystem for data
uploads *or any other queued filesystem operation*.</description>
</property>
<property>
<name>fs.s3a.max.total.tasks</name>
<value>5</value>
<description>The number of operations which can be queued for execution</description>
</property>
<property>
<name>fs.s3a.threads.keepalivetime</name>
<value>60</value>
<description>Number of seconds a thread can be idle before being terminated.</description>
</property>
<property>
<name>fs.s3a.connection.maximum</name>
<value>15</value>
</property>
</configuration>
Usage
Create a PutParquet processor. Under Properties set
Hadoop Configuration Resources: /home/nifi/s3conf/core-site.xml,
Additional Classpath Reources: /home/nifi/s3libs,
Directory: s3a://BUCKET_NAME/folder/ (EL available)
Compression Type: tested with NONE, SNAPPY
Remove CRC: true
The flow-file must contain a filename attribute - No fancy chars or slashes.

Hadoop 2.5.1 job stuck at map 0% and reduce 0%

I am trying to run a word count example. My current testing setup is:
NameNode and ResourceManager on one machine (10.38.41.134).
DataNode and NodeManager on another (10.38.41.135).
They can ssh between them without passwords.
When reading the logs, I don't get any warnings, except a security warning (I didn't set it up for testing) and a containermanager.AuxServices 'mapreduce_shuffle' warning. Upon submitting the example job, nodes react to it and output logs, which suggests that they can communicate well. NodeManager outputs memory usage, but the job doesn't budge.
Where should I even start looking for problems? Everything else I could find is either old or non-relevant. I followed the official cluster setup tutorial for version 2.5.1 which left way too many questions unanswered.
My conf files are following:
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://10.38.41.134:9000</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.rpc-bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>dfs.namenode.servicerpc-bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.client.block.write.replace-datanode-on-failure.enable</name>
<value>NEVER</value>
<description>
</description>
</property>
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>The runtime framework for executing MapReduce jobs.
Can be one of local, classic or yarn.
</description>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>300</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>10.38.41.134:50030</value>
</property>
</configuration>
Everything else is default.
I suggest you first try to get it working with a single server cluster so it's easier to debug.
When that is working, continue with two nodes.
As already suggested, memory might be an issue. Without tweaking the settings, it seems some 2GB is the minimum and I'd recommend at least 4GB per server. Also remember to check also the job's logs (under logs/userlogs, especially syslog).
P.S. I share your frustration about old / non-relevant documentation.

Hadoop - MapReduce runs extremely slow when using YARN

I know there is another question about this, but there are no answers yet, so I'm going to try and ask it in a more detail.
I am running a map-reduce job using Hadoop 2.2.0 on a 2 node cluster that I have setup on Amazon 2 EC2 instances; the master node is a medium instance and the slave node is also a medium instance. It runs extremely slowly, it takes over 17 minutes, but when I run the same exact job on the same cluster without yarn it runs in under 1 minute. Here is what my mapred-site.xml looks like:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
If I change the mapreduce.framework property to 'local, so that the file simply reads:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>local</value>
</property>
</configuration>
I can then run the same map-reduce job in less than a minute. However, I would like to use YARN, so that I can track the map-reduce job through the webapp. When I run the job with the mapreduce.framework property set to yarn it takes 17+ minutes to run the same exact job. I cannot imagine that YARN would slow down a map-reduce job to such an extreme level.
I am also using "top" to track my CPU usage, and it seems that when I run it with yarn, the CPU usage is split between the different nodes, however, when I change run it with "local" all of the processing is done on the master node. I'm not sure how this makes sense, because it seems to me, that when the CPU processing is split between the different nodes, it should run faster, not slower. Is there something I forgot to configure in Hadoop to make running on a cluster faster?
Here are the rest of my configuration files:
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://namenode:8020</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>file:/home/ubuntu/hadoop/hdfs/snn</value>
</property>
<property>
<name>fs.checkpoint.edits.dir</name>
<value>file:/home/ubuntu/hadoop/hdfs/snn</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/ubuntu/hadoop/hdfs/nn</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/ubuntu/hadoop/hdfs/nn</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>namenode:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>namenode:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>namenode:8030</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>namenode:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>namenode:8088</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Is there something wrong with the way I set this up? Has anyone else ran into this problem? Any help will be greatly appreciated, Thanks!
I wish i still remember where I read this so I can give you a reference. You won't benefit with yarn unless you have large size cluster.

When is hdfs-site.xml loaded by hadoop?

I have hive and hadoop installed in my system.
This is my hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
If i do bin/start-all.sh and go to my hive and run a select query I get the error :
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically.
If I wait for some time and run the hive query again, it works.
I read that the safemode threshold is set using the property: dfs.namenode.safemode.threshold-pct
I added that property in my hdfs-site.xml
<property>
<name>dfs.namenode.safemode.threshold-pct</name>
<value>0.500f</value>
</property>
Again i started all hadoop nodes, and run the hive query, But i was still getting the same error
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will
It means that either my xml is wrong, or I have to do something else to actually load the hdfs-site.xml.
Can someone tell me what I am doing wrong?
I was doing a mistake. I went a checked the hdfs-default.xml in the src folder and found this
<property>
<name>dfs.safemode.threshold.pct</name>
<value>0.999f</value>
<description>
Specifies the percentage of blocks that should satisfy
the minimal replication requirement defined by dfs.replication.min.
Values less than or equal to 0 mean not to start in safe mode.
Values greater than 1 will make safe mode permanent.
</description>
</property>
I think i am using old version of hadoop because dfs.safemode.threshold.pct is deprecatd.
Modified my hdfs-site.xml,stopped and started the namenode
<property>
<name>dfs.safemode.threshold.pct</name>
<value>2</value>
</property>
and it worked!
The ratio of reported blocks 0.0000 has not reached the threshold 2.0000. Safe mode will be turned off automatically.

Resources