Data Nodes not Appearing in Hadoop Cluster - hadoop

I am trying to create a hadoop cluster. `hdfs' is starting normally and I am lso able to access it through the web interface. But, data nodes are not showing. Here is the result I got by running jps command.
node1 is the master-node. But running `yarn node -list' is not returning the list of worker nodes.
hdfs and yarn seem to have started normally. Here is the response from start-dfs.sh and start-yarn.sh.
What should I do?
Here is the xml file for yarn-site.xml. The memory for the nodes is 2GB each.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node1</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- Resource-->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1536</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>1536</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
I have tried to restart services using stop-all.sh then restarted them. Any advice?

The problem was on the yarn-site.xml file. I should have added the resource manager address. If node1 is the name of the master node, then:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node1</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>node1:8032</value>
</property>

Related

How to start datanode in hadoop slave machine?

I'm creating hadoop cluster using yarn configuration, i have 2 VMs from virtual box, but when i run the command start-all.sh (start-dfs.sh and start-yarn.sh), i get a possitive anwser with jps both on master and slave terminal, but when i access master-ip:9870 on web there is no datanode started
core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoopuser/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoopuser/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
workers
hadoop-slave1
/etc/hosts
master-ip hadoop-master
slave-ip hadoop-slave1
The configuration above is in both master and slave machine.
I also have the JAVA_HOME, HADOOP_HOME and PDSH_RCMD_TYPE in my .bashrc. And i have created the ssh key in master and shared it with the slave authorized for allows ssh connection.
In master machine i have this output:
In my slave machine:
I have 0 nodes in my hdfs web visualization:
But i can see the slave node in yarn configuration:
I deleted hadoop tmp files and the datanode folders before format my hdfs on master, and start all processes. I'm using hadoop 3.2.1

Hadoop job keeps running and no container is allocated

I tried running a mapreduce job in Hadoop 2.8.5 but it keeps running.
The Application State is as below:
YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register with RM.
RM web UI:
The health-report says: 1/1 local-dirs are bad: /home/hduser/hadooptmpdata/nm-local-dir; 1/1 log-dirs are bad: /home/hduser/hadoop-2.8.5/logs/userlogs
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/hadooptmpdata</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<name>dfs.name.dir</name>
<value>file:///home/hduser/hdfs/namenode</value>
<name>dfs.data.dir</name>
<value>file:///home/hduser/hdfs/datanode</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>100</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>3</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>3</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/home/hduser/hadooptmpdata/nm-local-dir</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.map.cpu.vcores</name>
<value>2</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.cpu.vcores</name>
<value>2</value>
</property>
<property>
<name>mapreduce.cluster.local.dir</name>
<value>/home/user/hduser/hadooptmpdata/mapred/local</value>
</property>
</configuration>
I am running Hadoop on ubuntu and my pc have intel i7 processor with 16 gb RAM and 256 GB SSD
YARN's Resource Manager need compute resources from Node Manager(s) in order to run anything. Your Node Manager shows it's local directory is bad. Which means you have no compute resources available (which is verified looking at your cluster metrics. See all the zeros.) which is why your application is stuck in "ACCEPTED".
Fix your yarn.nodemanager.local-dirs and make sure YARN has full permissions on it to proceed.

How to change mapreduce temporary working directory /tmp to other folder

I am using hive and I want to change the mapreduce temporary working directory from /tmp to some other directory. I tried everything which could I find on internet but nothing is working. I can see by du -h command that /tmp is filling up during the mapreduce task. Please somebody help me to change the directory.
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/bd/tmp/hadoop-${user.name}</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/data/bd/tmp/hadoop/dfs/journalnode/</value>
</property>
</configuration>
mapred-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.cluster.local.dir</name>
<value>/data/bd/tmp/mapred/local</value>
</property>
<property>
<name>mapreduce.task.tmp.dir</name>
<value>/data/bd/tmp</value>
</property>
<property>
<name>mapreduce.cluster.temp.dir</name>
<value>/data/bd/tmp/mapred/temp</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/data/bd/tmp/hadoop-yarn/staging</value>
</property>
<property>
<name>mapreduce.jobtracker.system.dir</name>
<value>/data/bd/tmp/mapred/system</value>
</property>
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/data/bd/tmp/mapred/staging</value>
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>
</configuration>
yarn-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,
$HADOOP_YARN_HOME/share/hadoop/yarn/*,$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/data/bd/tmp/logs</value>
<description>The staging dir used while submitting jobs</description>
</property>
<property>
<name>yarn.timeline-service.entity-group-fs-store.active-dir</name>
<value>/data/bd/tmp/entity-file-history/active</value>
<description>HDFS path to store active application’s timeline data</description>
</property>
<property>
<name>yarn.timeline-service.entity-group-fs-store.done-dir</name>
<value>/data/bd/tmp/entity-file-history/done/</value>
<description>HDFS path to store done application’s timeline data</description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/data/bd/tmp/hadoop-ubuntu/nm-local-dir</value>
<description>List of directories to store localized files</description>
</property>
</configuration>
hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>password for connecting to mysql server</description>
</property>
<property>
<name>hive.exec.parallel</name>
<value>true</value>
<description>Whether to execute jobs in parallel</description>
</property>
<property>
<name>hive.exec.parallel.thread.number</name>
<value>8</value>
<description>How many jobs at most can be executed in parallel</description>
</property>
<property>
<name>hive.cbo.enable</name>
<value>true</value>
<description>Flag to control enabling Cost Based Optimizations using Calcite framework.</description>
</property>
<property>
<name>hive.compute.query.using.stats</name>
<value>true</value>
<description>
When set to true Hive will answer a few queries like count(1) purely using stats
stored in metastore. For basic stats collection turn on the config hive.stats.autogather to true.
For more advanced stats collection need to run analyze table queries.
</description>
</property>
<property>
<name>hive.stats.fetch.partition.stats</name>
<value>true</value>
<description>
Annotation of operator tree with statistics information requires partition level basic
statistics like number of rows, data size and file size. Partition statistics are fetched from
metastore. Fetching partition statistics for each needed partition can be expensive when the
number of partitions is high. This flag can be used to disable fetching of partition statistics
from metastore. When this flag is disabled, Hive will make calls to filesystem to get file sizes
and will estimate the number of rows from row schema.
</description>
</property>
<property>
<name>hive.stats.fetch.column.stats</name>
<value>true</value>
<description>
Annotation of operator tree with statistics information requires column statistics.
Column statistics are fetched from metastore. Fetching column statistics for each needed column
can be expensive when the number of columns is high. This flag can be used to disable fetching
of column statistics from metastore.
</description>
</property>
<property>
<name>hive.stats.autogather</name>
<value>true</value>
<description>A flag to gather statistics automatically during the INSERT OVERWRITE command.</description>
</property>
<property>
<name>hive.stats.dbclass</name>
<value>fs</value>
<description>
Expects one of the pattern in [jdbc(:.*), hbase, counter, custom, fs].
The storage that stores temporary Hive statistics. In filesystem based statistics collection ('fs'),
each task writes statistics it has collected in a file on the filesystem, which will be aggregated
after the job has finished. Supported values are fs (filesystem), jdbc:database (where database
can be derby, mysql, etc.), hbase, counter, and custom as defined in StatsSetupConst.java.
</description>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/data/bd/tmp</value>
<description>Scratch space for Hive jobs</description>
</property>
<property>
<name>hive.service.metrics.file.location</name>
<value>/data/bd/tmp/report.json</value>
<description>For metric class org.apache.hadoop.hive.common.metrics.metrics2.CodahaleMetrics JSON_FILE reporter, the location of local JSON metrics file. This file will get overwritten at every interval.</description>
</property>
<property>
<name>hive.query.results.cache.directory</name>
<value>/data/bd/tmp/hive/_resultscache_</value>
<description>unknown</description>
</property>
<property>
<name>hive.llap.io.allocator.mmap.path</name>
<value>/data/bd/tmp</value>
<description>unknown</description>
</property>
<property>
<name>hive.hbase.snapshot.restoredir</name>
<value>/data/bd/tmp</value>
<description>unknown</description>
</property>
<property>
<name>hive.druid.working.directory</name>
<value>/data/bd/tmp//workingDirectory</value>
<description>unknown</description>
</property>
<property>
<name>hive.querylog.location</name>
<value>/data/bd/tmp</value>
<description>logs hive</description>
</property>
</configuration>
For hadoop 2.7.1
Configure mapreduce.cluster.local.dir in $HADOOP_HOME/etc/hadoop/mapred-site.xml, it also supports comma-separated list of directories on different devices.
https://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

Building hadoop cluster on small nodes

I'm preparing hadoop cluster on four very small virtual servers (2GB RAM, 2Cores each) for a proof of concept.
One server as name node and resource manager and three are data nodes.
Every time I'm running the test job (3,4 GB file with data) - two of data nodes (random ones) are working at maximum capability and one of them is sleeping (monitoring via htop).
All 3 data nodes are visible in the hadoop GUI.
What am I missing?
Any help will be much appreciated.
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop-master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop-master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop-master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop-master:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop-master:8088</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop/hadoop/dfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hadoop/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>67108864</value>
</property>
I found the solution.
To increase number of reducers in the file mapred-site.xml I added
<property>
<name>A</name>
<value>5</value>
</property>
After I added additional nodes to cluster, hadoop has increased mappers without any additional change in the configuration. All data nodes are working at maximum capability now.

Hadoop's Capacity Scheduler - Setting up multiple queues

I tried to set up 2 queues - queue1,queue2.
I added the names of these queues to the mapred-site.xml
<property>
<name>mapred.queue.names</name>
<value>queue1,queue2</value>
</property>
I configured CapacityScheduler.xml as shown below.
<?xml version="1.0"?>
<configuration>
<property>
<name>mapred.capacity-scheduler.maximum-system-jobs</name>
<value>3000</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.capacity</name>
<value>100</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.capacity</name>
<value>100</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.maximum-capacity</name>
<value>-1</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.maximum-capacity</name>
<value>-1</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.supports-priority</name>
<value>false</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.supports-priority</name>
<value>false</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.minimum-user-limit-percent</name>
<value>100</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.minimum-user-limit-percent</name>
<value>100</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.maximum-initialized-active-tasks</name>
<value>200000</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.maximum-initialized-active-tasks</name>
<value>200000</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.maximum-initialized-active-tasks-per-user</name>
<value>100000</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.maximum-initialized-active-tasks-per-user</name>
<value>100000</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.init-accept-jobs-factor</name>
<value>10</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.init-accept-jobs-factor</name>
<value>10</value>
</property>
<property>
<name>mapred.capacity-scheduler.default-supports-priority</name>
<value>false</value>
</property>
<property>
<name>mapred.capacity-scheduler.default-minimum-user-limit-percent</name>
<value>100</value>
</property>
<property>
<name>mapred.capacity-scheduler.default-user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>mapred.capacity-scheduler.default-maximum-active-tasks-per-queue</name>
<value>200000</value>
</property>
<property>
<name>mapred.capacity-scheduler.default-maximum-active-tasks-per-user</name>
<value>100000</value>
</property>
<property>
<name>mapred.capacity-scheduler.default-init-accept-jobs-factor</name>
<value>10</value>
</property>
<!-- Capacity scheduler Job Initialization configuration parameters -->
<property>
<name>mapred.capacity-scheduler.init-poll-interval</name>
<value>5000</value>
</property>
<property>
<name>mapred.capacity-scheduler.init-worker-threads</name>
<value>5</value>
</property>
</configuration>
The bin/start-all.sh starts the following services.
17083 DataNode
17557 TaskTracker
17373 JobTracker
16902 NameNode
17279 SecondaryNameNode
17703 Jps
Im able to view the WEB UI for Jobtracker in
http://localhost:50030/
Tasktracker's WEB UI
http://localhost:50060/
shows "Unable to Connect". But after a few seconds the jobtracker and tasktracker shuts down. jps command on the terminal only shows
17083 DataNode
16902 NameNode
17279 SecondaryNameNode
17703 Jps
What might be the solution.
both of your queues have a capacity of 100 , which makes the capacity scheduler to think there are couple of queues that each have a capacity of 100%. I suggest you change the setting to :
<?xml version="1.0"?>
<configuration>
<property>
<name>mapred.capacity-scheduler.maximum-system-jobs</name>
<value>3000</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.capacity</name>
<value>80</value> <!-- change here -->
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.capacity</name>
<value>20</value> <!-- change here -->
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue1.maximum-capacity</name>
<value>-1</value>
</property>
<property>
<name>mapred.capacity-scheduler.queue.queue2.maximum-capacity</name>
<value>-1</value>
</property>
The sum of all your queues must always and only be 100 (ie 100%) you can have two queues with 100 and 0 percent respectively - that is valid.
Also I think it's good practice to always have a "default" queue, with some allocation at the very least. I don't know what the scheduler will do if you don't specify the queue name when you don't have a default.

Resources