Some info needed for Hadoop namenode - hadoop

I am trying to understand Hadoop and I am referring to this book: Hadoop: The definitive Guide".
I have some doubts in understanding the data which Namenode manages, please refer to the image below:
Based on this, I have the following questions:
Q1) What is the meaning of filesystem namespace?
Q2) what is the meaning of filesystem tree?
Q3) What is meta-data? Are meta-data and namespace two different things?
Q4) what is namespace image?
Q5) what is edit logs?
Can anyone please help me understanding this?
There are many terminologies involved and no clarity of term provided.

Filesystem tree... /, /home, /tmp, etc. The filesystem. HDFS is an abstraction layer over the physical disks it runs on.
Metadata.. File xyz is located at /tmp and is 5KB large and is read-only. Data stored that identifies any file - location, size, permissions, etc.
The namespace is the combination of these items.
An edit log is a transcript of actions performed against that image, to be fault tolerant and provide checkpoints at which data consistency is known. This mechanism has less overhead than comparing raw files within a distributed system.
The rest of the question is answered by namespace image and edit log

Related

Sync files on hdfs having same size but varies in contents

i am trying to sync files from one hadoop clutster to another using distcp and airbnb reair utility, but both of them are not working as expected.
if file size is same on source and destination both of them fails to update it even if file content are been changed(checksum also varies) unless overwrite option is not used.
I need to keep sync data of around 30TB so every time loading complete dataset is not feasible.
Could anyone please suggest how can i bring two dataset in sync if file size is same(count in source is changed) and have varied checksum.
The way DistCp handles syncing between files that are the same size but having different contents is by comparing its so-called FileChecksum. The FileChecksum was first introduced in HADOOP-3981, mostly for the purpose of being used in DistCp. Unfortunately, this has the known shortcoming of being incompatible between different storage implementations, and even incompatible between HDFS instances that have different internal block/chunk settings. Specifically, that FileChecksum bakes in the structure of having, for example, 512-bytes-per-chunk and 128MB-per-block.
Since GCS doesn't have the same notions of "chunks" or "blocks", there's no way for it to have any similar definition of a FileChecksum. The same is also true of all other object stores commonly used with Hadoop; the DistCp documentation appendix discusses this fact under "DistCp and Object Stores".
That said, there's a neat trick that can be done to define a nice standardized representation of a composite CRC for HDFS files that is mostly in-place compatible with existing HDFS deployments; I've filed HDFS-13056 with a proof of concept to try to get this added upstream, after which it should be possible to make it work out-of-the-box against GCS, since GCS also supports file-level CRC32C.

Understanding file handling in hadoop

I am new to Hadoop ecosystem with some basic idea. Please assist on following queries to start with:
If the file size (file that am trying to copy into HDFS) is very big and unable to accommodate with the available commodity hardware in my Hadoop ecosystem system, what can be done? Will the file wait until it gets an empty space or the there is an error?
How to find well in advance or predict the above scenario will occur in a Hadoop production environment where we continue to receive files from outside sources?
How to add a new node to a live HDFS ecosystem? There are many methods but I wanted to know which files I need to alter?
How many blocks does a node have? If I assume that a node is a CPU with storage(HDD-500 MB), RAM(1GB) and a processor(Dual Core). In this scenario is it like 500GB/64? assuming that each block is configured to hold 64 GB RAM
If I copyFromLocal a 1TB file into HDFS, which portion of the file will be placed in which block in which node? How can I know this?
How can I find which record/row of the input file is available in which file of the multiple files split by Hadoop?
What are the purpose of each xmls configured? (core-site.xml,hdfs-site.xml & mapred-site.xml). In a distributed environment, which of these files should be placed in all the slave Data Nodes?
How to know how many map and reduce jobs will run for any read/write activity? Will the write operation always have 0 reducer?
Apologize for asking some of the basic questions. Kindly suggest methods to find answers for all of the above queries.

Hadoop Spark (Mapr) - AddFile how does it work

I am trying to understand how does hadoop work. Say I have 10 directory on hdfs, it contains 100s of file which i want to process with spark.
In the book - Fast Data Processing with Spark
This requires the file to be available on all the nodes in the cluster, which isn't much of a
problem for a local mode. When in a distributed mode, you will want to use Spark's
addFile functionality to copy the file to all the machines in your cluster.
I am not able to understand this, will spark create copy of file on each node.
What I want is that it should read the file which is present in that directory (if that directory is present on that node)
Sorry, I am bit confused , how to handle the above scenario in spark.
regards
The section you're referring to introduces SparkContext::addFile in a confusing context. This is a section titled "Loading data into an RDD", but it immediately diverges from that goal and introduces SparkContext::addFile more generally as a way to get data into Spark. Over the next few pages it introduces some actual ways to get data "into an RDD", such as SparkContext::parallelize and SparkContext::textFile. These resolve your concerns about splitting up the data among nodes rather than copying the whole of the data to all nodes.
A real production use-case for SparkContext::addFile is to make a configuration file available to some library that can only be configured from a file on the disk. For example, when using MaxMind's GeoIP Legacy API, you might configure the lookup object for use in a distributed map like this (as a field on some class):
#transient lazy val geoIp = new LookupService("GeoIP.dat", LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE)
Outside your map function, you'd need to make GeoIP.dat available like this:
sc.addFile("/path/to/GeoIP.dat")
Spark will then make it available in the current working directory on all of the nodes.
So, in contrast with Daniel Darabos' answer, there are some reasons outside of experimentation to use SparkContext::addFile. Also, I can't find any info in the documentation that would lead one to believe that the function is not production-ready. However, I would agree that it's not what you want to use for loading the data you are trying to process unless it's for experimentation in the interactive Spark REPL, since it doesn't create an RDD.
addFile is only for experimentation. It is not meant for production use. In production you just open a file specified by a URI understood by Hadoop. For example:
sc.textFile("s3n://bucket/file")

Does hadoop use folders and subfolders

I have started learning Hadoop and just completed setting up a single node as demonstrated in hadoop 1.2.1 documentation
Now I was wondering if
When files are stored in this type of FS should I use a hierachial mode of storage - like folders and sub-folders as I do in Windows or files are just written into as long as they have a unique name?
Is it possible to add new nodes to the single node setup if say somebody were to use it in production environment. Or simply can a single node be converted to a cluster without loss of data by simply adding more nodes and editing the configuration?
This one I can google but what the hell! I am asking anyway, sue me. What is the maximum number of files I can store in HDFS?
When files are stored in this type of FS should I use a hierachial mode of storage - like folders and sub-folders as I do in Windows or files are just written into as long as they have a unique name?
Yes, use the directories to your advantage. Generally, when you run jobs in Hadoop, if you pass along a path to a directory, it will process all files in that directory. So.. you really have to use them anyway.
Is it possible to add new nodes to the single node setup if say somebody were to use it in production environment. Or simply can a single node be converted to a cluster without loss of data by simply adding more nodes and editing the configuration?
You can add/remove nodes as you please (unless by single-node, you mean pseudo-distributed... that's different)
This one I can google but what the hell! I am asking anyway, sue me. What is the maximum number of files I can store in HDFS?
Lots
To expand on climbage's answer:
Maximum number of files is a function of the amount of memory available to your Name Node server. There is some loose guidance that each metadata entry in the Name Node requires somewhere between 150-200 bytes of memory (it alters by version).
From this you'll need to extrapolate out to the number of files, and the number of blocks you have for each file (which can vary depending on file and block size) and you can estimate for a given memory allocation (2G / 4G / 20G etc), how many metadata entries (and therefore files) you can store.

Request clarification on some HDFS concepts

I am not sure if this questions belongs here. If not, then I apologize. I am reading the HDFS paper and am finding it difficult to understand a few terminologies. Please find my questions below.
1) As per the paper, "The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on the NameNode by inodes, which record attributes like permissions, modification and access times, namespace and disk space quotas."
What exactly does namespace information mean in inode. Does it mean the complete path of the file? Because, the previous statement says "The HDFS namespace is a hierarchy of files and directories".
2) As per the paper "The NameNode maintains the namespace tree and the mapping of file blocks to DataNodes
(the physical location of file data)." Are both namespace tree and namespace the same? Please refer to point 1 about definition of the namespace. How is the namespace tree information stored? Is it stored as part of inodes where each inode will also have a parent inode pointer?
3) As per the paper, "HDFS keeps the entire namespace in RAM. The inode data and the list of blocks belonging to each file comprise the metadata of the name system called the image." Does the image also contain the namespace?
4) What is the use of a namespace id? Is it used to distinguish between two different file system instances?
Thanks,
Venkat
What exactly does namespace information mean in inode. Does it mean the complete path of the file? Because, the previous statement says "The HDFS namespace is a hierarchy of files and directories
It means that you can browse your files like you do on your system ( via commands like hadoop dfs -ls) you will see results like : /user/hadoop/myFile.txt but physically this file is distributed on your cluster in several blocks according to your replication factor
Are both namespace tree and namespace the same? Please refer to point 1 about definition of the namespace. How is the namespace tree information stored? Is it stored as part of inodes where each inode will also have a parent inode pointer?
When you copy a file on your HDFS with commands like hadoop dfs -copyFrom local myfile.txt /user/hadoop/myfile.txt, the file is splitted according to the dfs.block.size value (default is 64MB). Then blocks are distributed on your datanodes (nodes used for storage). The namenode keep a map of all blocks in order to verify your data integrity when it starts (or with commands like hadoop fsck /).
Does the image also contain the namespace?
For this one I am not sure but I think the namespace is in RAM too.
What is the use of a namespace id? Is it used to distinguish between two different file system instances?
Yes, the namespace id is just an ID, it ensures the datanode data coherence.
I hope that helps you even it is far away from an exhaustive explanation.

Resources