Does changing the value of dfs.blocksizeaffect existing data - hadoop

My Hadoop version is 2.5.2. I am changing my dfs.blocksize in hdfs-site.xml file on the master node. I have the following question:
1) Will this change affect the existing data in HDFS
2) Do I need to propogate this change to all he nodes in Hadoop cluster or only on the NameNode is sufficient

1) Will this change affect the existing data in HDFS
No, it will not. It will keep the old block size on the old files. In order for it to take the new block change, you need to rewrite the data. You can either do a hadoop fs -cp or a distcp on your data. The new copy will have the new block size and you can delete your old data.
2) Do I need to propogate this change to all he nodes in Hadoop cluster or only on the NameNode is sufficient?
I believe in this case you only need to change the NameNode. However, this is a very very bad idea. You need to keep all of your configuration files in sync for a number of good reasons. When you get more serious about your Hadoop deployment, you should probably start using something like Puppet or Chef to manage your configs.
Also, note that whenever you change a configuration, you need to restart the NameNode and DataNodes in order for them to change their behavior.
Interesting note: you can set the blocksize of individual files as you write them to overwrite the default block size. E.g., hadoop fs -D fs.local.block.size=134217728 -put a b

you should be making changes in hdfs-site.xml of all slaves also... dfs.block size should be consistent accross all datanodes.

ochanging the block size in hdfs-site.xml will only affect the new data.

which distribution you are using... by seeing your questions it looks like you are using apache distribution..easiest way i can find is write a shell script to first delete hdfs-site.xml in slaves like
ssh username#domain.com 'rm /some/hadoop/conf/hdfs-site.xml'
ssh username#domain2.com 'rm /some/hadoop/conf/hdfs-site.xml'
ssh username#domain3.com 'rm /some/hadoop/conf/hdfs-site.xml'
later copy the hdfs-site.xml from master to all the slaves
scp /hadoop/conf/hdfs-site.xml username#domain.com:/hadoop/conf/
scp /hadoop/conf/hdfs-site.xml username#domain2.com:/hadoop/conf/
scp /hadoop/conf/hdfs-site.xml username#domain3.com:/hadoop/conf/

Related

Uploading file in HDFS cluster

I was learning hadoop and till now I configured 3 Node cluster
127.0.0.1 localhost
10.0.1.1 hadoop-namenode
10.0.1.2 hadoop-datanode-2
10.0.1.3 hadoop-datanode-3
My hadoop Namenode directory looks like below
hadoop
bin
data-> ./namenode ./datanode
etc
logs
sbin
--
--
As I learned that when we upload a large file in the cluster in divide the file into blocks, I want to upload a 1Gig file in my cluster and want to see how it is being stored in datanode.
Can anyone help me with the commands to upload file and see where these blocks are being stored.
First, you need to check if you have Hadoop tools in your path, if not - I recommend integrate them into it.
One of the possible ways of uploading a file to HDFS:hadoop fs -put /path/to/localfile /path/in/hdfs
I would suggest you read the documentation and get familiar with high-level commands first as it will save you time
Hadoop Documentation
Start with "dfs" command, as this one of the most often used commands

why there is a need of hadoop commands in Pseudo-distributed mode?

It might be a stupid question but I needed to know.
For example: Why do we need hadoop fs -ls command to list files? Instead why can't just ls be used?
If in pseudo-distributed mode, is that case part of filesystem is given to hadoop file system that is only accessible to hadoop namenode daemon...this is my guess. Please explain.
ls will list all file spaces available to your computer
You can set the fs.defaultFS property to be file:///, the default, then both will act the same, but this is not considered pseudodistributed mode.
Pseudodistributed node requires that you specify a list of datanode and namenode volumes on each respective system in the cluster, and hdfs dfs commands will only list those files that are known by the namenode.
And its called pseudodistributed only because it's a single node. Once you have that working, adding another node should be straightforward given appropriate networking connections

How to change java.io.tmpdir for spark job running on yarn

How can I change java.io.tmpdir folder for my Hadoop 3 Cluster running on YARN?
By default it gets something like /tmp/***, but my /tmp filesystem is to small for everythingYARN Job will write there.
Is there a way to change it ?
I have also set hadoop.tmp.dir in core-site.xml, but it looks like, it is not really used.
perhaps its a duplicate of What should be hadoop.tmp.dir ?. Also, go through all .conf's in /etc/hadoop/conf and search tmp, see if anything is hardcoded. Also specify:
Whether you see (any) files getting created # what you specified as hadoop.tmp.dir.
What pattern of files are being formed # /tmp/** after your changes are applied.
I have also noticed hive creating files in /tmp. So, you may also have a look # hive-site.xml. Similar for any other ecosystem product you are using.
I have configured yarn.nodemanager.local-dirs property in yarn-site.xml and restarted the cluster. After that spark stopped using /tmp file system and used directories, configured in yarn.nodemanager.local-dirs.
java.io.tmpdir property for spark executors was also set to directories defined in yarn.nodemanager.local-dirs property.
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/somepath1,/anotherpath2</value>
</property>

Restarting datanodes after reformating namenode in a hadoop cluster

Using the basic configuration provided in the hadoop setup official documentation, I can run a hadoop cluster and submit mapreduce jobs.
The problem is whenever I stop all the daemons and reformat the namenode, when I subsequently start all the daemons, the datanode does not start.
I've been looking around for a solution and it appears that it is because the formatting only formats the namenode and the disk space for the datanode needs to be erased.
How can I do this? What changes do I need to make to my config files? After those changes are made, how do I delete the correct files when formatting the namenode again?
Specifically if you have provided configuration of below 2 parameters which can be defined in hdfs-site.xml
dfs.name.dir: Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
dfs.data.dir: Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored
if you have provided the specific directory location for above 2 parameters then you need to delete those directories as well before formating namenode .
if you have not provided the above 2 parameter so by default it gets created under below parameter :
hadoop.tmp.dir which can be configured in core-site.xml
Again if you have specified this parameter then you need to remove that directory before formating namenode .
if you have not defined so by default it gets created in /tmp/hadoop-$username(hadoop) user so you need to remove this directory .
Summary: you have to delete the name node and data node directory before formating the system. By default it gets created at /tmp/ location .

Run a Local file system directory as input of a Mapper in cluster

I gave an input to the mapper from a local filesystem.It is running successfully from eclipse,But not running from the cluster as it is unable to find the local input path saying:input path does not exist.Please can anybody help me how to give a local file path to a mapper so that it can run in the cluster and i can get the output in hdfs
This is a very old question. Recently faced the same issue.
I am not aware of how correct this solution is it worked for me though. Please bring to notice if there are any drawbacks of this.Here's what I did.
Reading a solution from the mail-archives, I realised if i modify fs.default.name from hdfs://localhost:8020/ to file:/// it can access the local file system. However, I didnt want this for all my mapreduce jobs. So I made a copy of core-site.xml in a local system folder (same as the one from where I would submit my MR jar to hadoop jar).
and in my Driver class for MR I added,
Configuration conf = new Configuration();
conf.addResource(new Path("/my/local/system/path/to/core-site.xml"));
conf.addResource(new Path("/usr/lib/hadoop-0.20-mapreduce/conf/hdfs-site.xml"));
The MR takes input from local system and writes the output to hdfs:
Running in a cluster requires the data to be loaded into distributed storage (HDFS). Copy the data to HDFS first using hadoop fs -copyFromLocal and then try to trun your job again, giving it the path of the data in HDFS
The question is an interesting one. One can have data on S3 and access this data without an explicit copy to HDFS prior to running the job. In the wordcount example, one would specify this as follows:
hadoop jar example.jar wordcount s3n://bucket/input s3n://bucket/output
What occurs in this is that the mappers read records directly from S3.
If this can be done with S3, why wouldn't hadoop similarly, using this syntax instead of s3n
file:///input file:///output
?
But empirically, this seems to fail in an interesting way -- I see that Hadoop gives a file not found exception for a file that is indeed in the input directory. That is, it seems to be able to list the files in the put directory on my local disk but when it comes time to open them to read the records, the file is not found (or accessible).
The data must be on HDFS for any MapReduce job to process it. So even if you have a source such as local File System or a network path or a web based store (such as Azure Blob Storage or Amazon Block stoage), you would need to copy the data at HDFS first and then run the Job.
The bottom line is that you would need to push the data first to to HDFS and there are several ways depend on data source, you would perform the data transfer from your source to HDFS such as from local file system you would use the following command:
$hadoop -f CopyFromLocal SourceFileOrStoragePath _HDFS__Or_directPathatHDFS_
Try setting the input path like this
FileInputFormat.addInputPath(conf, new Path(file:///the directory on your local filesystem));
if you give the file extension, it can access files from the localsystem
I have tried the following code and got the solution...
Please try it and let me know..
You need to get FileSystem object for local file system and then use makequalified method to return path.. As we need to pass path of local filesystem(no other way to pass this to inputformat), i ve used make qualified, which in deed returns only local file system path..
The code is shown below..
Configuration conf = new Configuration();
FileSystem fs = FileSystem.getLocal(conf);
Path inputPath = fs.makeQualified(new Path("/usr/local/srini/")); // local path
FileInputFormat.setInputPaths(job, inputPath);
I hope this works for your requirement, though it's posted very late.. It worked fine for me.. It does not need any configuration changes i believe..
U might wanna try this by setting the configuration as
Configuration conf=new Configuration();
conf.set("job.mapreduce.tracker","local");
conf.set("fs.default.name","file:///");
After this u can set the fileinputformat with the local path and u r good to go

Resources