Change Block size of existing files in Hadoop - hadoop

Consider a hadoop cluster where the default block size is 64MB in hdfs-site.xml. However, later on the team decides to change this to 128MB. Here are my questions for the above scenario?
Will this change require restart of the cluster or it will be taken up automatically and all new files will have the default block size of 128MB?
What will happen to the existing files which have block size of 64M? Will the change in the configuration apply to existing files automatically? If it will be automatically done, then when will this be done - as soon as the change is done or when the cluster is started? If not automatically done, then how to manually do this block change?

Will this change require restart of the cluster or it will be taken up
automatically and all new files will have the default block size of
128MB
A restart of the cluster will be required for this property change to take effect.
What will happen to the existing files which have block size of 64M?
Will the change in the configuration apply to existing files
automatically?
Existing blocks will not change their block size.
If not automatically done, then how to manually do this block change?
To change the existing files you can use distcp. It will copy over the files with the new block size. However, you will have to manually delete the old files with the older block size. Here's a command that you can use
hadoop distcp -Ddfs.block.size=XX /path/to/old/files /path/to/new/files/with/larger/block/sizes.

As mentioned here for your point:
Whenever you change a configuration, you need to restart the NameNode and DataNodes in order for them to change their behavior.
No, it will not. It will keep the old block size on the old files. In order for it to take the new block change, you need to rewrite the data. You can either do a hadoop fs -cp or a distcp on your data. The new copy will have the new block size and you can delete your old data.
check link for more information.

On point 1 - On Hadoop 1.2.1, A restart is not required after a change to dfs.block.size in hdfs-site.xml file. The file block size can be easily verified by checking the Hadoop administration page at http://namenode:50070/dfshealth.jsp
Ensure to change the dfs.block.size on all the data nodes.

Related

How to successfully complete a namenode restart with 5TB worth of edit files to process

I have a namenode that had to be brought down for an emergency that has not had an FSImage taken for 9 months and has about 5TB worth of edit files to process in its next restart. The secondary namenode has not been running (or had any checkpoint operations performed) since about 9 months ago, thus the 9 month old FSImage.
There are about 7.8 million inodes in the HDFS cluster. The machine has about 260GB of total memory.
We've tried a few different combinations of Java heap size, GC algorithms, etc... but have not been able to find a combination that allows the restart to complete without eventually slowing down to a crawl due to FGCs.
I have 2 questions:
1. Has anyone found a namenode configuration that allows this large of an edit file backlog to complete successfully?
An alternate approach I've considered is restarting the namenode with only a manageable subset of the edit files present. Once the namenode comes up and creates a new FSImage, bring it down, copy the next subset of edit files over, and then restart it. Repeat until it's processed the entire set of edit files. Would this approach work? Is it safe to do, in terms of the overall stability of the system and the file system?
We were able to get through the 5TB backlog of edits files using a version of what I suggested in my question (2) on the original post. Here is the process we went through:
Solution:
Make sure that the namenode is "isolated" from the datanodes. This can be done by either shutting down the datanodes, or just removing them from the slaves list while the namenode is offline. This is done to keep the namenode from being able to communicate with the datanodes before the entire backlog of edits files is processed.
Move the entire set of edits files to a location outside of what is configured on the dfs.namenode.name.dir property of the namenode's hdfs-site.xmlfile.
Move (or copy if you would like to maintain a backup) the next subset of edits files to be processed to the dfs.namenode.name.dir location. If you are not familiar with the naming convention for the FSImage and edits files, take a look at the example below. It will hopefully clarify what is meant by next subset of edits files.
Update file seen_txid to contain the value of the last transaction represented by the last edits file from the subset you copied over in step (3). So if the last edits file is edits_0000000000000000011-0000000000000000020, you would want to update the value of seen_txid to 20. This essentially fools the namenode into thinking this subset is the entire set of edits files.
Start up the namenode. If you take a look at the Startup Progress tab of the HDFS Web UI, you will see that the namenode will start with the latest present FSImage, process through the edits files present, create a new FSImage file, and then go into safemode while it waits for the datanodes to come online.
Bring down the namenode
There will be edits_inprogress_######## file created as a placeholder by the namenode. Unless this is the final set of edits files to process, delete this file.
Repeat steps 3-7 until you've worked through the entire backlog of edits files.
Bring up datanodes. The namenode should get out of safemode once it's been able to confirm the location of a number of data blocks.
Set up a secondary namenode, or high availability for your cluster, so that the FSImage will periodically get created from now on.
Example:
Let's say we have FSImage fsimage_0000000000000000010 and a bunch of edits files: edits_0000000000000000011-0000000000000000020
edits_0000000000000000021-0000000000000000030
edits_0000000000000000031-0000000000000000040
edits_0000000000000000041-0000000000000000050
edits_0000000000000000051-0000000000000000060
...
edits_0000000000000000091-0000000000000000100
Following the steps outlined above:
All datanodes brought offline.
All edits files copied from dfs.namenode.name.dir to another location, ex: /tmp/backup
Let's process 2 files at a time. So copy edits_0000000000000000011-0000000000000000020 and edits_0000000000000000021-0000000000000000030 over to the dfs.namenode.name.dir location.
Update seen_txid to contain a value of 30 since this is the last transaction we will be processing during this run.
Start up the namenode, and confirm through the HDFS Web UI's Startup Progress tab that it correctly used fsimage_0000000000000000010 as a starting point and then processed edits_0000000000000000011-0000000000000000020 and edits_0000000000000000021-0000000000000000030. It then created a new FSImage file fsimage_0000000000000000030` and entered safemode, waiting for the datanodes to come up.
Bring down the namenode
Delete the placeholder file edits_inprogress_######## since this is not the final set of edits files to be processsed.
Proceed with the next run and repeat until all edits files have been processed.
If your hadoop is HA enabled, then StandBy NN should have taken care of this, in case of non-HA your secondary NN.
Check logs of these namenode processes as why it is not able to merge/fail.
These below parameters drive your edit files save, and it shouldnt have created these many files.
dfs.namenode.checkpoint.period
dfs.namenode.checkpoint.txns
other way to manually perform the merge but this would be temporary.
hdfs dfsadmin -safemode enter
hdfs dfsadmin -rollEdits
hdfs dfsadmin -saveNamespace
hdfs dfsadmin -safemode leave
Running above command should merge and save the namespaces.

Can HDFS block size be changed during job run? Custom Split and Variant Size

I am using hadoop 1.0.3. Can the input split/block be changed (increase/decrease) during run time based on some constraints. Is there a class to override to accomplish this mechanism like FileSplit/InputTextFormat? Can we have variant size blocks in HDFS depending on logical constraint in one job?
You're not limited to TextInputFormat... Thats entirely configurable based on the data source you are reading. Most examples are line delimited plaintext, but that obviously doesn't work for XML, for example.
No, block boundaries can't change during runtime as your data should already be on disk, and ready to read.
But the InputSplit is dependent upon the InputFormat for the given job, which should remain consistent throughout a particular job, but the Configuration object in the code is basically a Hashmap which can be changed while running, sure
If you want to change block size only for a particular run or application you can do by overriding "-D dfs.block.size=134217728" .It helps you to change block size for your application instead of changing overall block size in hdfs-site.xml.
-D dfs.block.size=134217728

Can I get around the no-update restriction in HDFS?

Thanks for the answers. I'm still not quite getting the answer I want. It's a particular question involving HDFS and the concat api.
Here it is. When concat talks about files, does it mean only "files created and managed by HDFS?" Or will it work on files that are not known to HDFS but just happen to live on the datanodes?
The idea is to
Create a file and save it through HDFS. It's broken up into blocks and saved to the datanodes.
Go directly to the datanodes and make local copies of the blocks using normal shell commands.
Alter those copies. I now have a set of blocks that Hadoop doesn't know about. The checksums are definitely bad.
Use concat to stitch the copies together and "register" them with HDFS.
At the end of all that, I have two files as far as HDFS is concerned. The original and an updated copy. Essentially, I put the data blocks on the datanodes without going through Hadoop. The concat code put all those new blocks into a new HDFS file without having to pass the data through Hadoop.
I don't think this will work, but I need to be sure it won't. It was suggested to me as a possible solution to the update problem. I need to convince them this will not work.
The base philosophy of HDFS is:
write-once, read-many
then, it is not possible to update files with the base implementation of HDFS. You only can append at the end of a current file if you are using a Hadoop branch that allow it. (The original version doesn't allow it)
An alternative could be use a non-standard HDFS like Map-R file system: https://www.mapr.com/blog/get-real-hadoop-read-write-file-system#.VfHYK2wViko
Go for HBase which is built on top of Hadoop to support CRUD operations in big data hadoop world.
If you are not supposed to use No SQL database then there is no chance for updating HDFS files. Only option is to rewrite.

Storing mapreduce intermediate output on a remote server

I use a hadoop (version 1.2.0) cluster of 16 nodes, one with a public IP (the master) and 15 connected through a private network (the slaves).
Is it possible to use a remote server (in addition to these 16 nodes) for storing the output of the mappers? The problem is that the nodes are running out of disk space during the map phase and I cannot compress map output any more.
I know that mapred.local.dirin mapred-site.xml is used to set a comma-separated list of dirs where the tmp files are stored. Ideally, I would like to have one local dir (the default one) and one directory on the remote server. When the local disk fills, then I would like to use the remote disk.
I am not very sure about about this but as per the link (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) it says that:
The local directory is a directory where MapReduce stores intermediate data files.
May be a comma-separated list of directories on different devices in
order to spread disk i/o. Directories that do not exist are ignored.
Also there are some other properties which you should check out. These might be of help:
mapreduce.tasktracker.local.dir.minspacestart: If the space in mapreduce.cluster.local.dir drops under this, do not ask for more tasks. Value in bytes
mapreduce.tasktracker.local.dir.minspacekill: If the space in mapreduce.cluster.local.dir drops under this, do not ask more tasks until all the current ones have finished and cleaned up. Also, to save the rest of the tasks we have running, kill one of them, to clean up some space. Start with the reduce tasks, then go with the ones that have finished the least. Value in bytes.
The solution was to use the iSCSI technology. A technician helped us out to achieve that, so unfortunately I am not able to provide more details on that.
We mounted the remote disk to a local path (/mnt/disk) of each slave node, and created a tmp file there, with rwx priviledges for all users.
Then, we changed the $HADOOP_HOME/conf/mapred-site.xml file and added the property:
<property>
<name>mapred.local.dir</name>
<value>/mnt/disk/tmp</value>
</property>
Initially, we had two, comma-separated values for that property, with the first being the default value, but it still didn't work as expected (we still got some "No space left on device" errors). So we left only one value there.

updating file in distributed cache in hadoop

How can we update file/files in distributed cache?.
For instance I have a properties file in distributed cache Now I have added few more values in properties file.
Options:
Append new values in old file and restart the job.
Replace the old file with new one and restart the job.
Place the new file in new location and point to that location.
Which all above options are correct and Why ?
This requires an understanding of how distributed cache works:
When you add a file to distributed cache, at the time of running the job the file is copied to each task node and that file is available locally. Since it creates multiple copies : It cannot be modified.
Option 2 & 3 sound feasible but not sure if that is the right way.
If the file just has a bunch of properties you can set these in the configuration object instead of file in distributed cache. You could use the collector to write the output to the desired location. (I do not know your use case clearly so this may not be suitable).

Resources