Understanding file handling in hadoop - hadoop

I am new to Hadoop ecosystem with some basic idea. Please assist on following queries to start with:
If the file size (file that am trying to copy into HDFS) is very big and unable to accommodate with the available commodity hardware in my Hadoop ecosystem system, what can be done? Will the file wait until it gets an empty space or the there is an error?
How to find well in advance or predict the above scenario will occur in a Hadoop production environment where we continue to receive files from outside sources?
How to add a new node to a live HDFS ecosystem? There are many methods but I wanted to know which files I need to alter?
How many blocks does a node have? If I assume that a node is a CPU with storage(HDD-500 MB), RAM(1GB) and a processor(Dual Core). In this scenario is it like 500GB/64? assuming that each block is configured to hold 64 GB RAM
If I copyFromLocal a 1TB file into HDFS, which portion of the file will be placed in which block in which node? How can I know this?
How can I find which record/row of the input file is available in which file of the multiple files split by Hadoop?
What are the purpose of each xmls configured? (core-site.xml,hdfs-site.xml & mapred-site.xml). In a distributed environment, which of these files should be placed in all the slave Data Nodes?
How to know how many map and reduce jobs will run for any read/write activity? Will the write operation always have 0 reducer?
Apologize for asking some of the basic questions. Kindly suggest methods to find answers for all of the above queries.

Related

Adding new files to a running hadoop cluster

consider that you have 10GB data and you want to process them by a MapReduce program using Hadoop. Instead of copying all the 10GB at the beginning to HDFS and then running the program, I want to for example copy 1GB and start the work and gradually add the remaining 9GB during the time. I wonder if it is possible in Hadoop.
Thanks,
Morteza
Unfortunately this is not possible with MapReduce. When you initiate a MapReduce Job, part of the setup process is determining block locations of your input. If the input is only partially there, the setup process will only work on those blocks and wont dynamically add inputs.
If you are looking for a stream processor, have a look at Apache Storm https://storm.apache.org/ or Apache Spark https://spark.apache.org/

Confusion about distributed cache in Hadoop

What does the distribute cache actually mean? Having a file in distributed cache means that is it available in every datanode and hence there will be no internode communication for that data, or does it mean that the file is in memory in every node?
If not, by what means can I have a file in memory for the entire job? Can this be done both for map-reduce, as well as for a UDF..
(In particular there is some configuration data, comparatively small that I would like to keep in memory as a UDF applies on hive query...? )
Thanks and regards,
Dhruv Kapur.
DistributedCache is a facility provided by the Map-Reduce framework to cache files needed by applications. Once you cache a file for your job, hadoop framework will make it available on each and every data nodes (in file system, not in memory) where you map/reduce tasks are running. Then you can access the cache file as local file in your Mapper Or Reducer job. Now you can easily read the cache file and populate some collection (e.g Array, Hashmap etc.) in your code.
Refer https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/filecache/DistributedCache.html
Let me know if still you have some questions.
You can read the cache file as local file in your UDF code. After reading the file using JAVA APIs just populate any collection (In memory).
Refere URL http://www.lichun.cc/blog/2013/06/use-a-lookup-hashmap-in-hive-script/
-Ashish

Does hadoop use folders and subfolders

I have started learning Hadoop and just completed setting up a single node as demonstrated in hadoop 1.2.1 documentation
Now I was wondering if
When files are stored in this type of FS should I use a hierachial mode of storage - like folders and sub-folders as I do in Windows or files are just written into as long as they have a unique name?
Is it possible to add new nodes to the single node setup if say somebody were to use it in production environment. Or simply can a single node be converted to a cluster without loss of data by simply adding more nodes and editing the configuration?
This one I can google but what the hell! I am asking anyway, sue me. What is the maximum number of files I can store in HDFS?
When files are stored in this type of FS should I use a hierachial mode of storage - like folders and sub-folders as I do in Windows or files are just written into as long as they have a unique name?
Yes, use the directories to your advantage. Generally, when you run jobs in Hadoop, if you pass along a path to a directory, it will process all files in that directory. So.. you really have to use them anyway.
Is it possible to add new nodes to the single node setup if say somebody were to use it in production environment. Or simply can a single node be converted to a cluster without loss of data by simply adding more nodes and editing the configuration?
You can add/remove nodes as you please (unless by single-node, you mean pseudo-distributed... that's different)
This one I can google but what the hell! I am asking anyway, sue me. What is the maximum number of files I can store in HDFS?
Lots
To expand on climbage's answer:
Maximum number of files is a function of the amount of memory available to your Name Node server. There is some loose guidance that each metadata entry in the Name Node requires somewhere between 150-200 bytes of memory (it alters by version).
From this you'll need to extrapolate out to the number of files, and the number of blocks you have for each file (which can vary depending on file and block size) and you can estimate for a given memory allocation (2G / 4G / 20G etc), how many metadata entries (and therefore files) you can store.

How to achieve desired block size with Hadoop with data on local filesystem

I have a 2TB sequence file that I am trying to process with Hadoop which resides on a cluster set up to use a local (lustre) filesystem for storage instead of HDFS. My problem is that no matter what I try, I am always forced to have about 66000 map tasks when I run a map/reduce jobs with this data as input. This seems to correspond with a block size of 2TB/66000 =~ 32MB. The actual computation in each map task executes very quickly, but the overhead associated with so many map tasks slows things down substantially.
For the job that created the data and for all subsequent jobs, I have dfs.block.size=536870912 and fs.local.block.size=536870912 (512MB). I also found suggestions that said to try this:
hadoop fs -D fs.local.block.size=536870912 -put local_name remote_location
to make a new copy with larger blocks, which I did to no avail. I have also changed the stripe size of the file on lustre. It seems that any parameters having to do with block size are ignored for local file system.
I know that using lustre instead of HDFS is a non-traditional use of hadoop, but this is what I have to work with. I'm wondering if others either have experience with this, or have any ideas to try other than what I have mentioned.
I am using cdh3u5 if that is useful.

Hadoop dfs replicate

Sorry guys,just a simple question but I cannot find exact question on google.
The question about what's dfs.replication mean? If I made one file named filmdata.txt in hdfs, if I set dfs.replication=1,so is it totally one file(one filmdata.txt)?or besides the main file(filmdata.txt) hadoop will create another replication file.
shortly say:if set dfs.replication=1,there are totally one filmdata.txt,or two filmdata.txt?
Thanks in Advance
The total number of files in the file system will be what's specified in the dfs.replication factor. So, if you set dfs.replication=1, then there will be only one copy of the file in the file system.
Check the Apache Documentation for the other configuration parameters.
To ensure high availability of data, Hadoop replicates the data.
When we are storing the files into HDFS, hadoop framework splits the file into set of blocks( 64 MB or 128 MB) and then these blocks will be replicated across the cluster nodes.The configuration dfs.replication is to specify how many replications are required.
The default value for dfs.replication is 3, But this is configurable depends on your cluster setup.
Hope this helps.
The link provided by Praveen is now broken.
Here is the updated link describing the parameter dfs.replication.
Refer Hadoop Cluster Setup. for more information on configuration parameters.
You may want to note that files can span multiple blocks and each block will be replicated number of times specified in dfs.replication (default value is 3). The size of such blocks is specified in the parameter dfs.block.size.
In HDFS framework, we use commodity machines to store the data, these commodity machines are not high end machines like servers with high RAM, there will be a chance of loosing the data-nodes(d1, d2, d3) or a block(b1,b2,b3), as a result HDFS framework splits the each block of data(64MB, 128MB) into three replications(as a default) and each block will be stored in a separate data-nodes(d1, d2, d3). Now consider block(b1) gets corrupted in data-node(d1) the copy of block(b1) is available in data-node(d2) and data-node(d3) as well so that client can request data-node(d2) to process the block(b1) data and provide the result and same as if data-node(d2) fails client can request data-node(d3) to process block(b1) data . This is called-dfs.replication mean.
Hope you got some clarity.

Resources