What does it mean by 'local file system?' - hadoop

I'm currently reading about hadoop and I came across this which has puzzled me (please bear in mind that I am a complete novice when it comes to hadoop) -
Use the Hadoop get command to copy a file from HDFS to your local file
system:
$ hadoop hdfs dfs -get file_name /user/login_user_name
What is a local file system? I understand that a HDFS partitions a file into different blocks throughout the cluster (but I know there's more to it than that). My understanding of the above command is that I can copy a file from the cluster to my personal (i.e. local) computer? Or is that completely wrong? I'm just not entirely sure what is meant by a local file system.

LocalFS means it may be your LinuxFS or WindowsFS. And which is not part of the DFS.
Your understanding is correct, using -get you are getting file from HDFS to Local FS and you cannot use both hadoop and hdfs. Command should be like below
hdfs dfs -get file_name local_path or hadoop fs -get file_name local_path

As per the file system logic, you can divide the filesystem into different drives. in the same way, you can create hadoop file system as a separate filesystem in linux file system.
Your local file system would be the File system over which you have installed the hadoop.your machine act as local in this case when copying file from your machine to the hadoop.
you might want to look at :HDFS vs LFS

THINK of a cluster node (server) as having to fulfill 2 needs:
the need to store its own operating system, application and user data-related files; and
the need to store its portion of sharded or "distributed" cluster data files.
In each cluster data node then there needs to be 2 independent file systems:
the LOCAL ("non-distributed") file system:
stores the OS and all OS-related ancillary ("helper") files;
stores the binary files which form the applications which run on the server;
stores additional data files, but these exist as simple files which are NOT sharded/replicated/distributed in the server's "cluster data" disks;
typically comprised of numerous partitions - entire formatted portions of a single disk or multiple disks;
typically also running LVM in order to ensure "expandability" of these partitions containing critical OS-related code which cannot be permitted to saturate or the server will suffer catastrophic (unrecoverable) failure.
AND
the DISTRIBUTED file system:
stores only the sharded, replicated portions of what are actually massive data files "distributed" across all the other data drives of all the other data nodes in the cluster
typically comprised of at least 3 identical disks, all "raw" - unformatted, with NO RAID of any kind and NO LVM of any kind, because the cluster software (installed on the "local" file system) is actually responsible for its OWN replication and fault-tolerance, so that RAID and LVM would actually be REDUNDANT, and therefore cause unwanted LATENCY in the entire cluster performance.
LOCAL <==> OS and application and data and user-related files specific or "local" to the specific server's operation itself;
DISTRIBUTED <==> sharded/replicated data; capable of being concurrently processed by all the resources in all the servers in the cluster.
A file can START in a server's LOCAL file system where it is a single little "mortal" file - unsharded, unreplicated, undistributed; if you were to delete this one copy, the file is gone gone GONE...
... but if you first MOVE that file to the cluster's DISTRIBUTED file system, where it becomes sharded, replicated and distributed across at least 3 different drives likely on 3 different servers which all participate in the cluster, so that now if a copy of this file on one of these drives were to be DELETED, the cluster itself would still contain 2 MORE copies of the same file (or shard); and WHERE in the local system your little mortal file could only be processed by the one server and its resources (CPUs + RAM)...
... once that file is moved to the CLUSTER, now it's sharded into myriad smaller pieces across at least 3 different servers (and quite possibly many many more), and that file can have its little shards all processed concurrently by ALL the resources (CPUs & RAM) of ALL the servers participating in the cluster.
And that the difference between the LOCAL file system and the DISTRIBUTED file system operating on each server, and that is the secret to the power of cluster computing :-) !...
Hope this offers a clearer picture of the difference between these two often-confusing concepts!
-Mark from North Aurora

Related

Understanding file handling in hadoop

I am new to Hadoop ecosystem with some basic idea. Please assist on following queries to start with:
If the file size (file that am trying to copy into HDFS) is very big and unable to accommodate with the available commodity hardware in my Hadoop ecosystem system, what can be done? Will the file wait until it gets an empty space or the there is an error?
How to find well in advance or predict the above scenario will occur in a Hadoop production environment where we continue to receive files from outside sources?
How to add a new node to a live HDFS ecosystem? There are many methods but I wanted to know which files I need to alter?
How many blocks does a node have? If I assume that a node is a CPU with storage(HDD-500 MB), RAM(1GB) and a processor(Dual Core). In this scenario is it like 500GB/64? assuming that each block is configured to hold 64 GB RAM
If I copyFromLocal a 1TB file into HDFS, which portion of the file will be placed in which block in which node? How can I know this?
How can I find which record/row of the input file is available in which file of the multiple files split by Hadoop?
What are the purpose of each xmls configured? (core-site.xml,hdfs-site.xml & mapred-site.xml). In a distributed environment, which of these files should be placed in all the slave Data Nodes?
How to know how many map and reduce jobs will run for any read/write activity? Will the write operation always have 0 reducer?
Apologize for asking some of the basic questions. Kindly suggest methods to find answers for all of the above queries.

How to make Hadoop Map Reduce process multiple files in a single run ?

For Hadoop Map Reduce program when we run it by executing this command $hadoop jar my.jar DriverClass input1.txt hdfsDirectory. How to make Map Reduce process multiple files( input1.txt & input2.txt ) in a single run ?
Like that:
hadoop jar my.jar DriverClass hdfsInputDir hdfsOutputDir
where
hdfsInputDir is the path on HDFS where your input files are stored (i.e., the parent directory of input1.txt and input2.txt)
hdfsOutputDir is the path on HDFS where the output will be stored (it should not exist before running this command).
Note that your input should be copied on HDFS before running this command.
To copy it to HDFS, you can run:
hadoop dfs -copyFromLocal localPath hdfsInputDir
This is your small files problem. for every file mapper will run.
A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files, then you probably have lots of them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS can’t handle lots of files.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.
solution
HAR files
Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of files putting pressure on the namenode’s memory. HAR files work by building a layered filesystem on top of HDFS. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files. To a client using the HAR filesystem nothing has changed: all of the original files are visible and accessible (albeit using a har:// URL). However, the number of files in HDFS has been reduced.
Sequence Files
The usual response to questions about “the small files problem” is: use a SequenceFile. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. Going back to the 10,000 100KB files, you can write a program to put them into a single SequenceFile, and then you can process them in a streaming fashion (directly or using MapReduce) operating on the SequenceFile. There are a couple of bonuses too. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support compression as well, unlike HARs. Block compression is the best option in most cases, since it compresses blocks of several records (rather than per record).

Does hadoop use folders and subfolders

I have started learning Hadoop and just completed setting up a single node as demonstrated in hadoop 1.2.1 documentation
Now I was wondering if
When files are stored in this type of FS should I use a hierachial mode of storage - like folders and sub-folders as I do in Windows or files are just written into as long as they have a unique name?
Is it possible to add new nodes to the single node setup if say somebody were to use it in production environment. Or simply can a single node be converted to a cluster without loss of data by simply adding more nodes and editing the configuration?
This one I can google but what the hell! I am asking anyway, sue me. What is the maximum number of files I can store in HDFS?
When files are stored in this type of FS should I use a hierachial mode of storage - like folders and sub-folders as I do in Windows or files are just written into as long as they have a unique name?
Yes, use the directories to your advantage. Generally, when you run jobs in Hadoop, if you pass along a path to a directory, it will process all files in that directory. So.. you really have to use them anyway.
Is it possible to add new nodes to the single node setup if say somebody were to use it in production environment. Or simply can a single node be converted to a cluster without loss of data by simply adding more nodes and editing the configuration?
You can add/remove nodes as you please (unless by single-node, you mean pseudo-distributed... that's different)
This one I can google but what the hell! I am asking anyway, sue me. What is the maximum number of files I can store in HDFS?
Lots
To expand on climbage's answer:
Maximum number of files is a function of the amount of memory available to your Name Node server. There is some loose guidance that each metadata entry in the Name Node requires somewhere between 150-200 bytes of memory (it alters by version).
From this you'll need to extrapolate out to the number of files, and the number of blocks you have for each file (which can vary depending on file and block size) and you can estimate for a given memory allocation (2G / 4G / 20G etc), how many metadata entries (and therefore files) you can store.

How much overhead a cachedDistributed file has in a mapreduce program?

How much overhead each cachedDistributed file has in a map-reduce program? I have a mapreduce program in which I need to have 50 cachedDistributed files (of very small size), it seems that the overhead they have is much larger than the case in which I have only 1 cachedDistributed file. Is that true?
As far as I understood, cachedDistributed files are copied to each machine that runs a mapper, thus access to a cachedDistributed file is local and shouldn't have too much overhead.
I think you may try to use archive files (files are unarchived on the task node automitically).
You can add archive files to the DistributedCache by to mean :
With tool that use GenericOptionsParser. Then, you can specify the files to be distributed as a comma-separated list of URIs as the argument to -archives option. If you don't specify the scheme , the files are assumed to be local. So, when you launch the job, the local file is copied to the distributed filesystem (often HDFS)
$> hadoop jar foo.jar ClassUsingDistributedCacheFile -archives archive.jar input output
With the distributed cache API (see the javaDoc). With the API, the files specified by the URI must be in a shared filesystem (so the java API does not copy the file.
Before a task is run, the tasktracker copies the files from the distributed filesystem to a local disk, as you say. I think the overhead come from retrieving all your little files in the HDFS

How does CopyFromLocal command for Hadoop DFS work?

I'm a little confused on how the Hadoop Distributed File System is set up and how my particular setup affects it. I used this guide to set it up http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ using two Virtual Machines on Virtual Box and have run the example (just a simple word count with txt file input). So far, I know that the datanode manages and retrieves the files on its node, while the tasktracker analyzes the data.
1) When you use the command -copyFromLocal, are you are copying files/input to the HDFS? Does Hadoop know how to divide the information between the slaves/master, and how does it do it?
2) In the configuration outlined in the guide linked above, are there technically two slaves (the master acts as both the master and a slave)? Is this common or is the master machine usually only given jobtracker/namenode tasks?
There are lot of questions asked here.
Question 2)
There are two machines
These machines are configured for HDFS and Map-Reduce.
HDFS configuration requires Namenode (master) and Datanodes (Slave)
Map-reduce requires Jobtracker (master) and Tasktracker (Slave)
Only one Namenode and Jobtracker is configured but you can have Datanode and Tasktracker services on both the machines. It is not the machine which acts as master and slave. It is just the services. You can have slave services also installed on machines which contains master services. It is good for simple development setup. In large scale deployment, you dedicate master services to separate machines.
Question 1 Part 2)
It is HDFS job to create file chunk and store on multiple data nodes in replicated manner. You don't have to worry about it.
Question 1 Part 1)
Hadoop file operations are patterned like typical Unix file operations - ls, put etc
Hadoop fs -put localefile /data/somefile --> will copy a localfile to HDFS at path /data/somefile
With put option you can also read from standard input and write to a HDFS file
copyFromLocal is similar to put option except that behavior is restricted to copying from local file system to HDFS
See: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html#copyFromLocal
1)
The client connects to the name node to register a new file in HDFS.
The name node creates some metadata about the file (either using the default block size, or a configured value for the file)
For each block of data to be written, the client queries the name node for a block ID and list of destination datanodes to write the data to. Data is then written to each of the datanodes.
There is some more information in the Javadoc for org.apache.hadoop.hdfs.DFSClient.DFSOutputStream
2) Some production systems will be configured to make the master its own dedicated node (allowing the maximum possible memory allocation, and to avoid CPU contention), but if you have a smaller cluster, then a node which contains a name node and data node is acceptable

Resources