Mahout parallel k-means in Hadoop - hadoop

Is it possible to run Mahout k-means algorithm in parallel (multi-core) using Hadoop? How?
Mahout run using Hadoop but it only uses one CPU:
mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job --input testdata --output end1200_50 --numClusters 1200 --t1 1000 --t2 500 --maxIter 50
Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.10.1-job.jar
[...]
My files are in HDFS hadoop fs -ls /user/root/testdata
Found 12 items
-rw-r--r-- 1 root supergroup 373560731 2015-06-26 07:51 /user/root/testdata/16773m.mat.txt
-rw-r--r-- 1 root supergroup 373819865 2015-06-26 07:51 /user/root/testdata/16786m.mat.txt
[...]
my mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>14</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx7000M</value>
</property>
</configuration>

Related

hdfs file is not distributed

I am new to Hadoop and I'm going to configure the hadoop cluster. The Version of Hadoop is 3.1.3. I want to set the NameNode, DataNode, NodeManager on host hadoop102, DataNode, ResourceNode, NodeManager on host hadoop103, and SecondaryNameNode, DataNode, NodeManager on hadoop104
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop102:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop-3.1.3/data</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop102:9870</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop104:9868</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop103</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
workers
hadoop102
hadoop103
hadoop104
I upload the test file from host hadoop102 with the command
hadoop fs -put $HADOOP_HOME/wcinput/word.txt /input
Why the file is only available on hadoop102? I think the file should be copied into hadoop103, hadoop104 in the local file system.
File Information
You need to know that HDFS is not like replicated file system, so if you put one file to HDFS does not mean that it will be placed on data nodes as files (under / filesystem for example).
HDFS splits the file into blocks, and these blocks are replicated on your cluster and configured by replication factor.
When you run -copyFromLocal or hdfs put what does perform is just split the file into blocks and send these blocks in replicated fashion.
So if one node goes down. you can still retrieve your file.
But where's my file? the file will not be in your machines' local filesystem. It will be stored on data nodes.
How can you configure the number of replicas?
You can setup dfs.replication to 3 in hdfs-site.xml
and you set number of replica for a file:
hadoop fs –setrep –w 3 /my/file
You can change the replication factor of all the files under a directory.
hadoop fs –setrep –w 3 -R /my/dir

Hadoop 3.2.1 Multinode Cluster Nodemanager is not running

I have Hadoop 3.2.1 installed on Ubuntu 16.04lts and my cluster has 18 datanodes and 1 master.
After running:
$ start-dfs.sh
$ start-yarn.sh
$ jps
On master I get the following:
ResourceManager
NameNode
SecondaryNameNodecode
jps
And on datanodes:
DataNode
jps
All the nodes seems to be live:
NameNode Overview Web Page
But when I reach the Cluster overview, none of my datanodes seems to be active:
Cluster Overview
My configurations files:
core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadoop-3.2.1/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:9000</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/hadoop-3.2.1/data/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/hadoop-3.2.1/data/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
The namenode and datanode directories exists on every host (master and datanodes)
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services </name>
<value> mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
</configuration>
Also I have configured hadoop-env.sh for JAVA_HOME Path and all the other variables are in .bashrc file (also in every host).
I have modified the /etc/hosts file to include all the hosts with their IPs and hostnames and finally I have also modified the workers file to include all the IPs of the datanodes.
The first time I have formatted the NameNode, the directories for the hdfs-site.xml was wrong (I had the datanode dir twice), so hdfs make its own directories under /tmp/hdfs/ (if I remember correctly). But I fixed this with formating again the NameNode with the corect directories.

HBase to Use HDFS HA

I am trying to setup hbase ha with Hadoop HA.
I have set up Hadoop HA, and tested it.
But in HBase setup, while starting, I am getting the following error:
2020-05-02 16:11:09,336 INFO [main] ipc.RpcServer: regionserver/cluster-hadoop-01/172.18.20.3:16020: started 10 reader(s) listening on port=16020
2020-05-02 16:11:09,473 INFO [main] metrics.MetricRegistries: Loaded MetricRegistries class org.apache.hadoop.hbase.metrics.impl.MetricRegistriesImpl
2020-05-02 16:11:09,840 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: Failed construction of Regionserver: class org.apache.hadoop.hbase.regionserver.HRegionServer
at org.apache.hadoop.hbase.regionserver.HRegionServer.constructRegionServer(HRegionServer.java:2896)
at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:64)
at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:127)
at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2911)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hbase.regionserver.HRegionServer.constructRegionServer(HRegionServer.java:2894)
... 5 more
Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfscluster
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:417)
at org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:132)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:351)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:285)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:160)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at org.apache.hadoop.hbase.util.CommonFSUtils.getRootDir(CommonFSUtils.java:309)
at org.apache.hadoop.hbase.util.CommonFSUtils.isValidWALRootDir(CommonFSUtils.java:358)
at org.apache.hadoop.hbase.util.CommonFSUtils.getWALRootDir(CommonFSUtils.java:334)
at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeFileSystem(HRegionServer.java:683)
at org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:626)
... 10 more
Caused by: java.net.UnknownHostException: hdfscluster
... 26 more
I think my HBase setup doesn't recognize my nameservice hdfscluster.
I tried Hadoop 2.X and Hadoop 3.X.
Hadoop 2.X: Hadoop 2.10.0 & HBase 1.6.0 & JDK 1.8.0_251 & ZooKeeper 3.6.0.
Hadoop 3.X: Hadoop 3.2.1 & HBase 2.2.4 & JDK 1.8.0_251 & ZooKeeper 3.6.0.
OS Version: Ubuntu 16.04.6
My core-site.xml has
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hdfscluster</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/data/hadoop/tmp</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>cluster-hadoop-01:2181,cluster-hadoop-02:2181,cluster-hadoop-03:2181</value>
</property>
</configuration>
My hdfs-site.xml has
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/data/hadoop/data/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/data/hadoop/data/hdfs/data</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>hdfscluster</value>
</property>
<property>
<name>dfs.ha.namenodes.hdfscluster</name>
<value>nn-01,nn-02</value>
</property>
<property>
<name>dfs.namenode.rpc-address.hdfscluster.nn-01</name>
<value>cluster-hadoop-01:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.hdfscluster.nn-02</name>
<value>cluster-hadoop-02:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.hdfscluster.nn-01</name>
<value>cluster-hadoop-01:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.hdfscluster.nn-02</name>
<value>cluster-hadoop-02:9870</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://cluster-hadoop-01:8485;cluster-hadoop-02:8485;cluster-hadoop-03:8485/hdfscluster</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/data/hadoop/tmp/journalnode</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence(hadoop:22)</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hadoop/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
</configuration>
My hbase-site.xml has
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://hdfscluster/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>cluster-hadoop-01,cluster-hadoop-02,cluster-hadoop-03</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/data/zookeeper/data</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>/data/hbase/tmp</value>
</property>
</configuration>
My hbase-env.sh has
export JAVA_HOME="/opt/jdk"
export HBASE_MANAGES_ZK=false
export HADOOP_HOME="/opt/hadoop"
export HBASE_CLASSPATH=".:${HADOOP_HOME}/etc/hadoop"
export HBASE_LOG_DIR="/data/hbase/log"
My HBase conf path:
root#cluster-hadoop-01:~# ll /opt/hbase/conf/
total 56
drwxr-xr-x 2 root root 4096 May 2 16:31 ./
drwxr-xr-x 7 root root 4096 May 2 01:18 ../
-rw-r--r-- 1 root root 18 May 2 10:36 backup-masters
lrwxrwxrwx 1 root root 36 May 2 12:04 core-site.xml -> /opt/hadoop/etc/hadoop/core-site.xml
-rw-r--r-- 1 root root 1811 Jan 6 01:24 hadoop-metrics2-hbase.properties
-rw-r--r-- 1 root root 4616 Jan 6 01:24 hbase-env.cmd
-rw-r--r-- 1 root root 7898 May 2 10:36 hbase-env.sh
-rw-r--r-- 1 root root 2257 Jan 6 01:24 hbase-policy.xml
-rw-r--r-- 1 root root 841 May 2 16:10 hbase-site.xml
lrwxrwxrwx 1 root root 36 May 2 12:04 hdfs-site.xml -> /opt/hadoop/etc/hadoop/hdfs-site.xml
-rw-r--r-- 1 root root 1169 Jan 6 01:24 log4j-hbtop.properties
-rw-r--r-- 1 root root 4949 Jan 6 01:24 log4j.properties
-rw-r--r-- 1 root root 54 May 2 10:33 regionservers
Through my continuous attempts, I found a solution, but I still do not know the reason.
Modify hdfs-site.xml configuration file:
<property>
<name>dfs.client.failover.proxy.provider.hdfscluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
The official document does not require the nameservice id.
link: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
As I am going through the same issue, I got to know that we have to use the same machines for both HBase and HDFS.
e.g.
Node-1 -> Should have Active Namenode & HBase MAster
Node-2 -> Should have StandBy Namenode, Datanode & HBase Backup Master, regionserver
Node-3 -> Should have Datanode & regionserver
NOTE: Namenode & HBase Master machines should be same and Datanode & regionserver machines should be same.
OR Another Solution, if you need to keep them on separate nodes
Just have a copy of hdfs-ste.xml in to your $HBASE_HOME/conf directory on each node of your Hbase cluster.
Make sure to have hostnames of hdfs cluster in /etc/hosts files as well.
Any further suggestions are most welcome!

how to increase capacity of hdfs in Hadoop 2.x

I've been trying to find how to increase capacity of hdfs in Hadoop 2.7.2 with spark 2.0.0.
I read this link.
But I don't understand it. Here is my core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>hadoop_eco/hadoop/tmp</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://com1:9000</value>
</property>
</configuration>
and hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>hadoop_eco/hadoop/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>hadoop_eco/hadoop/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
When I run spark with 1 namenode and 10 datanodes, I got this error message:
org.apache.hadoop.hdfs.StateChange: DIR* completeFile:
/user/spark/_temporary/0/_temporary/attempt_201611141313_0001_m_000052_574/part-00052
is closed by DFSClient_NONMAPREDUCE_1638755846_140
I couldn't identify this error, but it may related to lack of disk capacity.
My configured capacity (hdfs) is 499.76GB and each datanode's capacity is 49.98GB.
So, is there a method to increase capacity of hdfs?
I solved it.
It is so easy to change the capacity of hdfs.
I tried to change hdfs-site.xml
<property>
<name>dfs.datanode.data.dir</name>
<value>file://"your directory path"</value>
</property>
and use this command line
hadoop namenode -format
stop-all.sh
start-all.sh
finally check your capacity of hdfs using hdfs dfsadmin -report

Not able to see Job History(http://localhost:19888) page in web browser in Hadoop

I am using Hadoop version 2.4.1 on Ubuntu 14.04 32 bit.
When I run a sample job using hadoop jar user_jar.jar command, I am not able to see output on http://localhost:19888 (Page not found)
What could be the possible reason ?
Thank you in advance.
JPS output :
3931 Jps
3719 NodeManager
3420 SecondaryNameNode
3593 ResourceManager
3246 DataNode
3126 NameNode
core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Run mr-jobhistory-daemon:
$ $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONFIG_DIR start historyserver
Now
$ jps
2135 DataNode
2339 SecondaryNameNode
2627 NodeManager
3176 JobHistoryServer
1971 NameNode
3213 Jps
2485 ResourceManager
and
$ netstat -ntlp | grep 19888
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 127.0.0.1:19888 0.0.0.0:* LISTEN 3176/java

Resources