When is hdfs-site.xml loaded by hadoop? - hadoop

I have hive and hadoop installed in my system.
This is my hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
If i do bin/start-all.sh and go to my hive and run a select query I get the error :
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically.
If I wait for some time and run the hive query again, it works.
I read that the safemode threshold is set using the property: dfs.namenode.safemode.threshold-pct
I added that property in my hdfs-site.xml
<property>
<name>dfs.namenode.safemode.threshold-pct</name>
<value>0.500f</value>
</property>
Again i started all hadoop nodes, and run the hive query, But i was still getting the same error
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will
It means that either my xml is wrong, or I have to do something else to actually load the hdfs-site.xml.
Can someone tell me what I am doing wrong?

I was doing a mistake. I went a checked the hdfs-default.xml in the src folder and found this
<property>
<name>dfs.safemode.threshold.pct</name>
<value>0.999f</value>
<description>
Specifies the percentage of blocks that should satisfy
the minimal replication requirement defined by dfs.replication.min.
Values less than or equal to 0 mean not to start in safe mode.
Values greater than 1 will make safe mode permanent.
</description>
</property>
I think i am using old version of hadoop because dfs.safemode.threshold.pct is deprecatd.
Modified my hdfs-site.xml,stopped and started the namenode
<property>
<name>dfs.safemode.threshold.pct</name>
<value>2</value>
</property>
and it worked!
The ratio of reported blocks 0.0000 has not reached the threshold 2.0000. Safe mode will be turned off automatically.

Related

Hadoop localhost:9870 browser interface is not working

I need to do data analysis using Hadoop. Therefore I have installed Hadoop and configured as below. But localhost:9870 is not working. Even I have format namenode every time I worked with that. Some articles and answers of this forum mentioned that 9870 is the updated one from 50070. I have win 10. I also referred answers in this forum but none of them worked. Java-home and hadoop-home paths are set. Paths to bin and sbin of hadoop are also set up. Can anyone please tell me what I am doing wrong in here?
I referred this site to do the installation and configuration.
https://medium.com/#pedro.a.hdez.a/hadoop-3-2-2-installation-guide-for-windows-10-454f5b5c22d3
core-site.xml
I have set up the Java path in this xml as well.
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9870</value>
</property>
hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-3.2.2\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-3.2.2\data\datanode</value>
</property>
mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
If you look at the namenode logs, it very likely has an error saying something about a port already being in use.
The default fs.defaultFS port should be 9000 - https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html ; you shouldn't change this without good reason.
The Namenode web UI isn't the value in fs.defaultFS. It's default port is 9870, and is defined by dfs.namenode.http-address in hdfs-site.xml
need to do data analysis
You can do analysis on Windows without Hadoop using Spark, Hive, MapReduce, etc. directly and it'll have direct access to your machine without being limited by YARN container sizes.

ERROR datanode.DataNode: Exception in secureMain

I was trying to install Hadoop on windows.
Namenode is working fine but Data Node is not working fine. Following error is being displayed again and again even after trying for several times.
Following Error is being shown on CMD regarding dataNode:
2021-12-16 20:24:32,624 INFO checker.ThrottledAsyncChecker: Scheduling a check for [DISK]file:/C:/Users/mtalha.umair/datanode 2021-12-16 20:24:32,624 ERROR datanode.DataNode: Exception in secureMain org.apache.hadoop.util.DiskChecker$DiskErrorException: Invalid value configured for dfs.datanode.failed.volumes.tolerated -
1. Value configured is >= to the number of configured volumes (1).
at org.apache.hadoop.hdfs.server.datanode.checker.StorageLocationChecker.check(StorageLocationChecker.java:176)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2799)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2714)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2756)
at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2900)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2924) 2021-12-16 20:24:32,640 INFO util.ExitUtil: Exiting with status 1: org.apache.hadoop.util.DiskChecker$DiskErrorException: Invalid value configured for dfs.datanode.failed.volumes.tolerated - 1. Value configured is >= to the number of configured volumes (1). 2021-12-16 20:24:32,640 INFO datanode.DataNode: SHUTDOWN_MSG:
I have referred to many different articles but to no avail. I have tried to use another version of Hadoop but the problem remains and as I am just starting out, I can't fully understand the problem therefore I need help
these are my configurations
-For core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
For mapred-site.xml
mapreduce.framework.name
yarn
-For yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
-For hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/D:/big-data/hadoop-3.1.3/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>datanode</value> </property> <property>
<name>dfs.datanode.failed.volumes.tolerated</name>
<value>1</value> </property> <property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Well unfortunately the reason this is failing is exactly what the message says. Let me try to say it another way.
dfs.datanode.failed.volumes.tolerated = 1
The number of (dfs.datanode.data.dir) folders you have configured is 1.
You are saying you will tolerate no data drives (1 drive configured and you'll tolerate it breaking). This does not make sense and is why this is being raised as an issue.
You need to alter it so there's a gap of at least 1 (so that you can still have a running datanode.)
Here are your options:
Configure more data volumes (2) with dfs.datanode.failed.volumes.tolerated Set to 1. For example, store data in both your C and D drive.
dfs.datanode.failed.volumes.tolerated to 0; and keep you data volumes
as is (1)

Giraph Job running in local mode always

I ran Giraph 1.1.0 on Hadoop 2.6.0.
The mapredsite.xml looks like this
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>The runtime framework for executing MapReduce jobs. Can be one of
local, classic or yarn.</description>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
<name>mapreduce.reduce.memory.mb</name>
<value>8192</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx3072m</value>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx6144m</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>4</value>
</property>
</configuration>
The giraph-site.xml looks like this
<configuration>
<property>
<name>giraph.SplitMasterWorker</name>
<value>true</value>
</property>
<property>
<name>giraph.logLevel</name>
<value>error</value>
</property>
</configuration>
I do not want to run the job in the local mode. I have also set environment variable MAPRED_HOME to be HADOOP_HOME. This is the command to run the program.
hadoop jar myjar.jar hu.elte.inf.mbalassi.msc.giraph.betweenness.BetweennessComputation /user/$USER/inputbc/inputgraph.txt /user/$USER/outputBC 1.0 1
When I run this code that computes betweenness centrality of vertices in a graph, I get the following exception
Exception in thread "main" java.lang.IllegalArgumentException: checkLocalJobRunnerConfiguration: When using LocalJobRunner, you cannot run in split master / worker mode since there is only 1 task at a time!
at org.apache.giraph.job.GiraphJob.checkLocalJobRunnerConfiguration(GiraphJob.java:168)
at org.apache.giraph.job.GiraphJob.run(GiraphJob.java:236)
at hu.elte.inf.mbalassi.msc.giraph.betweenness.BetweennessComputation.runMain(BetweennessComputation.java:214)
at hu.elte.inf.mbalassi.msc.giraph.betweenness.BetweennessComputation.main(BetweennessComputation.java:218)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
What should I do to ensure that the job does not run in local mode?
I have met the problem just a few days ago.Fortunately i solved it by doing this.
Modify the configuration file mapred-site.xml,make sure the value of property 'mapreduce.framework.name' to be 'yarn' and add the property 'mapreduce.jobtracker.address' which value is 'yarn' if there is not.
The mapred-site.xml looks like this:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>yarn</value>
</property>
</configuration>
Restart hadoop after modifying the mapred-site.xml.Then run your program and set the value which is after '-w' to be more than 1 and the value of 'giraph.SplitMasterWorker' to be 'true'.It will probably work.
As for the cause of the problem,I just quote somebody's saying:
These properties are designed for single-node executions and will have to be
changed when executing things in a cluster of nodes. In such a situation, the
jobtracker has to point to one of the machines that will be executing a
NodeManager daemon (a Hadoop slave). As for the framework, it should be
changed to 'yarn'.
We can see that in the stack-trace where the configuration check in LocalJobRunner fails this is a bit misleading because it makes us assume that we run in local model.You already found the responsible configuration option: giraph.SplitMasterWorker but in your case you set it to true. However, on the command-line with the last parameter 1 you specify to use only a single worker. Hence the framework decides that you MUST be running in local mode. As a solution you have two options:
Set giraph.SplitMasterWorker to false although you are running on a cluster.
Increase the number of workers by changing the last parameter to the command-line call.
hadoop jar myjar.jar hu.elte.inf.mbalassi.msc.giraph.betweenness.BetweennessComputation /user/$USER/inputbc/inputgraph.txt /user/$USER/outputBC 1.0 4
Please refer also to my other answer at SO (Apache Giraph master / worker mode) for details on the problem concerning local mode.
If you are after to split the master from the node you can use:
-ca giraph.SplitMasterWorker=true
also to specify the amount of workers you can use:
-w #
where "#" is the number of workers you want to use.

ResouceManager got stucked in Accepted State

I am trying to integrate my es 2.2.0 version with hadoop HDFS.In my envoirnment,I have 1 master node and 1 data node. On my master node my Es is installed.
But while integrating it with HDFS my resource manager applications jobs get stuck in Accepted state.
Somehow i found link to change my yarn-site.xml settings:
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2200</value>
<description>Amount of physical memory, in MB, that can be allocated for containers.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>500</value>
</property>
I have done this also but it is not giving me expected output.
Configuration:-
my core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.
</description> </property>
<property> <name>fs.default.name</name>
<value>
hdfs://localhost:54310
</value>
<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
</description>
</property>
my mapred-site.xml,
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description>
</property>
my hdfs-site.xml,
<property>
<name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description>
</property>
Please help me how can i change my RM job to running state.So that i can use my elasticsearch data on HDFS.
If the screenshot is correct - you have 0 nodemanager - thus the application can’t start running - you need to start at least 1 nodemanager, so that application master and later tasks can be started.

mahout ssvd job performance

I need to compute ssvd.
For 50 000 x 50 000 matrix, when reducing to 300x300 libraries such as ssvdlibc and other can compute it in less than 3 minutes;
I wanted to do it for big data, tried using mahout. Firstly I tried to run it locally on my small data set (that is 50000 x 50000), but it takes 32 minutes to complete that simple job, uses around 5,5GB of disk space for spill files, cause my intel i5 with 8GiB ram and SSD drive to freeze for few times.
I understand that mahout and hadoop must do lots of additional steps to perform everything as map-reduce job, but the performance hit just seems to big. I think I must have something wrong in my setup.
I've read some hadoop and mahout documentation, added few parameters in my config files, but its still incredibly slow. Most of time it uses only one CPU.
Can someone please tell me is what is wrong with my setup ? Can It be somehow tuned for that simple, one mahine use just to see what to look for for bigger deployment ?
my config files :
mapred-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>local</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx5000M</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>3</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>3</value>
</property>
<property>
<name>io.sort.factor</name>
<value>35</value>
</property>
</configuration>
core-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>file:///</value>
</property>
<!--
<property>
<name>fs.inmemory.size.mb</name>
<value>200</value>
</property>
<property>
<name>io.sort.factor</name>
<value>100</value>
</property>
-->
<property>
<name>io.sort.mb</name>
<value>200</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
I run my job like that:
mahout ssvd --rank 400 --computeU true --computeV true --reduceTasks 3 --input ${INPUT} --output ${OUTPUT} -ow --tempDir /tmp/ssvdtmp/
I also configured hadoop to and mahout with -Xmx=4000m
Well so first of all I would verify that it is running in parallel, make sure hdfs replication is set to "1", and just generally check your params. That only seeing one core be used is definitely an issue!
But!
The problem with slowness is probably not going to go away completely, you might be able to speed it up significantly with proper configuration, but at the end of the day the hadoop model is not going to outcompete optimized shared memory model libraries on a single computer.
The power of hadoop/mahout is for big data, and honestly 50k x 50k is still in the realm of fairly small, easily manageable on a single computer. Essentially, Hadoop trades speed for scalability. So while it might not outcompete those other two with 50000 x 50000, try to get them to work on 300000 x 300000 while with Hadoop you are sitting pretty on a distributed cluster.

Resources