how to increase capacity of hdfs in Hadoop 2.x

how to increase capacity of hdfs in Hadoop 2.x - hadoop

I've been trying to find how to increase capacity of hdfs in Hadoop 2.7.2 with spark 2.0.0.
I read this link.
But I don't understand it. Here is my core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>hadoop_eco/hadoop/tmp</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://com1:9000</value>
</property>
</configuration>
and hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>hadoop_eco/hadoop/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>hadoop_eco/hadoop/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
When I run spark with 1 namenode and 10 datanodes, I got this error message:
org.apache.hadoop.hdfs.StateChange: DIR* completeFile:
/user/spark/_temporary/0/_temporary/attempt_201611141313_0001_m_000052_574/part-00052
is closed by DFSClient_NONMAPREDUCE_1638755846_140
I couldn't identify this error, but it may related to lack of disk capacity.
My configured capacity (hdfs) is 499.76GB and each datanode's capacity is 49.98GB.
So, is there a method to increase capacity of hdfs?

I solved it.
It is so easy to change the capacity of hdfs.
I tried to change hdfs-site.xml
<property>
<name>dfs.datanode.data.dir</name>
<value>file://"your directory path"</value>
</property>
and use this command line
hadoop namenode -format
stop-all.sh
start-all.sh
finally check your capacity of hdfs using hdfs dfsadmin -report

Related

hdfs file is not distributed

I am new to Hadoop and I'm going to configure the hadoop cluster. The Version of Hadoop is 3.1.3. I want to set the NameNode, DataNode, NodeManager on host hadoop102, DataNode, ResourceNode, NodeManager on host hadoop103, and SecondaryNameNode, DataNode, NodeManager on hadoop104
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop102:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop-3.1.3/data</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop102:9870</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop104:9868</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop103</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
workers
hadoop102
hadoop103
hadoop104
I upload the test file from host hadoop102 with the command
hadoop fs -put $HADOOP_HOME/wcinput/word.txt /input
Why the file is only available on hadoop102? I think the file should be copied into hadoop103, hadoop104 in the local file system.
File Information

You need to know that HDFS is not like replicated file system, so if you put one file to HDFS does not mean that it will be placed on data nodes as files (under / filesystem for example).
HDFS splits the file into blocks, and these blocks are replicated on your cluster and configured by replication factor.
When you run -copyFromLocal or hdfs put what does perform is just split the file into blocks and send these blocks in replicated fashion.
So if one node goes down. you can still retrieve your file.
But where's my file? the file will not be in your machines' local filesystem. It will be stored on data nodes.
How can you configure the number of replicas?
You can setup dfs.replication to 3 in hdfs-site.xml
and you set number of replica for a file:
hadoop fs –setrep –w 3 /my/file
You can change the replication factor of all the files under a directory.
hadoop fs –setrep –w 3 -R /my/dir

Hadoop 3.2.1 Multinode Cluster Nodemanager is not running

I have Hadoop 3.2.1 installed on Ubuntu 16.04lts and my cluster has 18 datanodes and 1 master.
After running:
$ start-dfs.sh
$ start-yarn.sh
$ jps
On master I get the following:
ResourceManager
NameNode
SecondaryNameNodecode
jps
And on datanodes:
DataNode
jps
All the nodes seems to be live:
NameNode Overview Web Page
But when I reach the Cluster overview, none of my datanodes seems to be active:
Cluster Overview
My configurations files:
core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadoop-3.2.1/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:9000</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/hadoop-3.2.1/data/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/hadoop-3.2.1/data/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
The namenode and datanode directories exists on every host (master and datanodes)
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services </name>
<value> mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
</configuration>
Also I have configured hadoop-env.sh for JAVA_HOME Path and all the other variables are in .bashrc file (also in every host).
I have modified the /etc/hosts file to include all the hosts with their IPs and hostnames and finally I have also modified the workers file to include all the IPs of the datanodes.
The first time I have formatted the NameNode, the directories for the hdfs-site.xml was wrong (I had the datanode dir twice), so hdfs make its own directories under /tmp/hdfs/ (if I remember correctly). But I fixed this with formating again the NameNode with the corect directories.

Pseudo Distributed Mode Hadoop

I have installed Pseudo Distributed mode Hadoop 2.7.3 in Mac & did all Configuration which is specified in Plural Sight. I Copied Csv file from Local to hdfs. But next day when i searched for files , it is not present in hdfs and removed automatically. Is there any other conf setting so that my files are not loss?
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Thanks,

Add these properties to hdfs-site.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/username/hadoop-dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/username/hadoop-dfs/data</value>
</property>
The metadata and data blocks are stored under /tmp by default as it is the value of hadoop.tmp.dir. The contents inside /tmp are deleted on reboot.
After adding these properties, format the namenode and start the services.

Cannot create directory /home/hadoop/hadoopinfra/hdfs/namenode/current

I get the error
Cannot create directory /home/hadoop/hadoopinfra/hdfs/namenode/current
While trying to install hadoop on my local Mac.
What could be the reason for this? Just for reference, I'm putting my xml files down below:
mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
</property>
</configuration>
core-site.xml:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
I think my problem lies in my hdfs-site.xml file, but I'm not sure how to pinpoint/change it.
I'm using this tutorial, and "hadoop" in the file path is replaced by my username.

Possible error: misconfiguration of the hdfs-site.xml file
This happened to me when I was following a setup tutorial. The contents of the hdfs-site.xml for me was
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Only then I realized that the text hadoop in the above file corresponds to the user name, where in my case, it had to replaced with hduser. When both occurrences of hadoop was replaced with hduser, the hdfs namenode -format command worked fine.

I had this problem too and it was a permission problem. I just did:
sudo chmod 777 /home/hadoop/hadoopinfra/hdfs/namenode/
and works!

In the step where you need to verify the hadoop installation, instead of 'hdfs namenode -format' use '/usr/local/hadoop/bin/hdfs namenode -format'
Found this answer from:
hadoop java.io.IOException: while running namenode -format

If you are not using any other distro than native hadoop, then add the current user to hadoop group and retry formatting the namenode.
sudo usermod -a -G hadoop <current-username>
In case of using thirdparty hadoop distros such Cloudera, Hortonworks or MapR, switch to root user and again switch to hdfs user then try formatting the namenode will succeed.
$ sudo -i
$ su - hdfs
$ hdfs namenode -format

Try the Hadoop command with sudo

MapReduce job hangs, waiting for AM container to be allocated

I tried to run simple word count as MapReduce job. Everything works fine when run locally (all work done on Name Node). But, when I try to run it on a cluster using YARN (adding mapreduce.framework.name=yarn to mapred-site.conf) job hangs.
I came across a similar problem here:
MapReduce jobs get stuck in Accepted state
Output from job:
*** START ***
15/12/25 17:52:50 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/12/25 17:52:51 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/12/25 17:52:51 INFO input.FileInputFormat: Total input paths to process : 5
15/12/25 17:52:52 INFO mapreduce.JobSubmitter: number of splits:5
15/12/25 17:52:52 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1451083949804_0001
15/12/25 17:52:53 INFO impl.YarnClientImpl: Submitted application application_1451083949804_0001
15/12/25 17:52:53 INFO mapreduce.Job: The url to track the job: http://hadoop-droplet:8088/proxy/application_1451083949804_0001/
15/12/25 17:52:53 INFO mapreduce.Job: Running job: job_1451083949804_0001
mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.job.tracker</name>
<value>localhost:54311</value>
</property>
<!--
<property>
<name>mapreduce.job.tracker.reserved.physicalmemory.mb</name>
<value></value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>3000</value>
<source>mapred-site.xml</source>
</property> -->
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<!--
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3000</value>
<source>yarn-site.xml</source>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>500</value>
</property>
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>3000</value>
</property>
-->
</configuration>
//I the left commented options - they were not solving the problem
YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register with RM.
What can be the problem?
EDIT:
I tried this configuration (commented) on machines: NameNode(8GB RAM) + 2x DataNode (4GB RAM). I get the same effect: Job hangs on ACCEPTED state.
EDIT2:
changed configuration (thanks #Manjunath Ballur) to:
yarn-site.xml:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-droplet</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop-droplet:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop-droplet:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop-droplet:8030</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop-droplet:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop-droplet:8088</value>
</property>
<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$YARN_HOME/*,$YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/data/1/yarn/local,/data/2/yarn/local,/data/3/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/data/1/yarn/logs,/data/2/yarn/logs,/data/3/yarn/logs</value>
</property>
<property>
<description>Where to aggregate logs</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/var/log/hadoop-yarn/apps</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>50</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>390</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>390</value>
</property>
</configuration>
mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>50</value>
</property>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx40m</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>50</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>50</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx40m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx40m</value>
</property>
</configuration>
Still not working.
Additional info: I can see no nodes on cluster preview (similar problem here: Slave nodes not in Yarn ResourceManager )

You should check the status of Node managers in your cluster. If the NM nodes are short on disk space then RM will mark them "unhealthy" and those NMs can't allocate new containers.
1) Check the Unhealthy nodes: http://<active_RM>:8088/cluster/nodes/unhealthy
If the "health report" tab says "local-dirs are bad" then it means you need to cleanup some disk space from these nodes.
2) Check the DFS dfs.data.dir property in hdfs-site.xml. It points the location on local file system where hdfs data is stored.
3) Login to those machines and use df -h & hadoop fs - du -h commands to measure the space occupied.
4) Verify hadoop trash and delete it if it's blocking you.
hadoop fs -du -h /user/user_name/.Trash and hadoop fs -rm -r /user/user_name/.Trash/*

I feel, you are getting your memory settings wrong.
To understand the tuning of YARN configuration, I found this to be a very good source: http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_yarn_tuning.html
I followed the instructions given in this blog and was able to get my jobs running. You should alter your settings proportional to the physical memory you have on your nodes.
Key things to remember is:
Values of mapreduce.map.memory.mb and mapreduce.reduce.memory.mb should be at least yarn.scheduler.minimum-allocation-mb
Values of mapreduce.map.java.opts and mapreduce.reduce.java.opts should be around "0.8 times the value of" corresponding mapreduce.map.memory.mb and mapreduce.reduce.memory.mb configurations. (In my case it is 983 MB ~ (0.8 * 1228 MB))
Similarly, value of yarn.app.mapreduce.am.command-opts should be "0.8 times the value of" yarn.app.mapreduce.am.resource.mb
Following are the settings I use and they work perfectly for me:
yarn-site.xml:
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1228</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>9830</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>9830</value>
</property>
mapred-site.xml
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>1228</value>
</property>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx983m</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1228</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1228</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx983m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx983m</value>
</property>
You can also refer to the answer here: Yarn container understanding and tuning
You can add vCore settings, if you want your container allocation to take into account CPU also. But, for this to work, you need to use CapacityScheduler with DominantResourceCalculator. See the discussion about this here: How are containers created based on vcores and memory in MapReduce2?

This has solved my case for this error:
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>100</value>
</property>

Check your hosts file on master and slave nodes. I had exactly this problem. My hosts file looked like this on master node for example
127.0.0.0 localhost
127.0.1.1 master-virtualbox
192.168.15.101 master
I changed it like below
192.168.15.101 master master-virtualbox localhost
So it worked.

These lines
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>100</value>
</property>
in the yarn-site.xml solved my problem since the node will be marked as unhealthy when disk usage is >=95%. Solution mainly suitable for pseudodistributed mode.

You have 512 MB RAM on each of the instance and all your memory configurations in yarn-site.xml and mapred-site.xml are 500 MB to 3 GB. You will not be able to run any thing on the cluster. Change every thing to ~256 MB.
Also your mapred-site.xml is using framework to by yarn and you have job tracker address which is not correct. You need to have resource manager related parameters in yarn-site.xml on a multinode cluster (including resourcemanager web address). With out that, the cluster does not know where your cluster is.
You need to revisit both your xml files.

anyway that's work for me .thank you a lot! #KaP
that's my yarn-site.xml
<property>
<name>yarn.resourcemanager.hostname</name>
<value>MacdeMacBook-Pro.local</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>${yarn.resourcemanager.hostname}:8088</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
that's my mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

The first thing is to check yarn resource manager logs. I had searched the Internet about this problem for a very long time, but nobody told me how to find out what is really happening. It's so straightforward and simple to check yarn resource manager logs. I am confused why people ignore logs.
For me, there was a error in log
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=172.16.0.167/172.16.0.167:55622]
That's because I switched wifi network in my work place, so my computer IP changed.

Old question, but I got on the same issue recently and in my case it was due to manually setting the master to local in the code.
Please, search for conf.setMaster("local[*]") and remove it.
Hope it helps.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

how to increase capacity of hdfs in Hadoop 2.x - hadoop

Related

hdfs file is not distributed

Hadoop 3.2.1 Multinode Cluster Nodemanager is not running

Pseudo Distributed Mode Hadoop

Cannot create directory /home/hadoop/hadoopinfra/hdfs/namenode/current

MapReduce job hangs, waiting for AM container to be allocated

Categories

Resources