I have few questions regarding Hadoop architecture
In Mapreduce can we dynamically modify the block size and no of mappers,if so how do we do?
Hows does the block gets created in HDFS. For example the hadoop framework is installed on say redhat linux machine. The default block size of linux filesystem is 4k. Is the HDFS block a logical wrapper on the 4k blocks or how does a block gets created. also is it parallel or sequential? because for example a file has only 32 MB since the block size is 64 MB. Is the remaining 32 Mb reusable?
I want to see the location(data node) of all the blocks of particular file I just copied to the HDFS. Is there any command to do that from a single location?
If I move the video file to HDFS, how does the block allocation happen for this video file
In Mapreduce can we dynamically modify the block size and no of mappers?
I assume that you are looking for HDFS file system.
HDFS is distributed storage system and Mapreduce is distributed processing framework.
HDFS block size can be changed with hdfs-site.xml
Have a look at documentation page for various HDFS configurations.
dfs.blocksize
134217728 ( default value)
The default block size for new files, in bytes. You can use the following suffix (case insensitive): k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.), Or provide complete size in bytes (such as 134217728 for 128 MB).
Relate SE question:
How to set data block size in Hadoop ? Is it advantage to change it?
Hows does the block gets created in HDFS. For example the hadoop framework is installed on say redhat linux machine. The default block size of linux filesystem is 4k. Is the HDFS block a logical wrapper on the 4k blocks or how does a block gets created. also is it parallel or sequential? because for example a file has only 32 MB since the block size is 64 MB. Is the remaining 32 Mb reusable?
Remaining 32 MB is re-usable.
Have a look at this SE question for HDFS block write operation :
Hadoop file write
I want to see the location(data node) of all the blocks of particular file I just copied to the HDFS. Is there any command to do that from a single location?
hadoop fsck /path/to/file -files -blocks
Related SE question:
Viewing the number of blocks for a file in hadoop
If I move the video file to HDFS, how does the block allocation happen for this video file?
Number of blocks = File size in MB / DFS block size in MB
Once number of blocks have been identified, those blocks will be written as explained in Hadoop file write question.
Few more good questions:
Hadoop chunk size vs split vs block size
How hadoop decides how many nodes will do map and reduce tasks
Related
My file has a size of 10MB, I stored this in hadoop, but the default block size in hdfs is 64 MB. Thus, my file uses 10 MB out-of 64 MB. How will HDFS utilize the remaining 54 MB of free space in the same block?
Logically, if you files are smaller than block size than HDFS will reduce the block size for that particular files to the size of file. So HDFS will only use 10MB for storing 10MB of small files.It will not waste 54MB or leave it blank.
Small file sin HDFS are desribed in detail here : http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
The remaining 54MB would be utilized for some other file. So this is how it works, assume you do a put or copyFromLocal of 2 small files each with size 20MB and your block size is 64MB. Now HDFS calculates the available space (suppose previously you have saved a file of 10 MB in a 64MB block it includes these remaining 54MB as well)in the filesystem(not available blocks) and gives a report in terms of block. Since you have 2 files, with replication factor as 3, so a total of 6 blocks would be allocated for your files even if your file size is less than the block size. If the cluster doesn't have 6 blocks(6*64MB) then the put process would fail. Since the report is fetched in terms of space not in terms of blocks, you would never run out of blocks. The only time files are measured in blocks is at block allocation time.
Read this blog for more information.
I am new to Hadoop and i know HDFS is 64 mb (min) per block and can increase depending on the system. but as hdfs is installed on top of linux filesystem which is 4kb per block, does hadoop not suffer disk seek? also does hdfs interact with linux filesystem ?
Your thinking is correct to certain extent but look at the bigger picture. When this 64 MB is stored on the Linux file system, it is distributed across many nodes. Consequently, if you want to read 3 blocks (each 4 KB), stored on 3 different Linux file systems (machines), the seek will be for only 1 seek and not 3 seeks as reading will be in parallel.
I think this might help:
How are HDFS files getting stored on underlying OS filesystem?
I have taken below Quoting from Hadoop - The Definitive Guide:
Note, however, that small files do not take up any more disk space than is required to store the raw contents of the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB,
Here my questions
1) 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.) How does hdfs use the remaining 127M in this block?
2)Is there any chance to store another file in same block?
1 MB file stored in 128MB block with 3 replication. Then the file will be stored in 3 blocks and uses 3*1=3 MB only not 3*128=384 MB. But it shows each the block size as 128 MB. It is just an abstraction to store the metadata in the namenode, but not an actual memory size used.
No way to store more than a file in a single block. Each file will be stored in a separate block.
Reference:
https://stackoverflow.com/a/21274388/3496666
https://stackoverflow.com/a/15065274/3496666
https://stackoverflow.com/a/14109147/3496666
NameNode Memory Usage:
Every file, directory and block in HDFS is represented as an object. i.e. each entry i the namenode is reflected to a item.
in the namenode’s memory, and each of object/item occupies 150 to 200 bytes of namenode memory.memorandums prefer fewer large files as a result of the metadata that needs to be stored.
Consider a 1 GB file with the default block size of 64MB.
-Stored as a single file 1 GB file
Name: 1 item
Block=16
Total Item = 16*3( Replication factor=3) = 48 + 1(filename) = 49
Total NameNode memory: 150*49
-Stored as 1000 individual 1 MB files
Name: 1000
Block=1000
Total Item = 1000*3( Replication factor=3) = 3000 + 1000(filename) = 4000
Total NameNode memory: 150*4000
Above results clarify that large number of small files is a overhead of naemnode memory as it takes more space of NameNode memory.
Block Name and Block ID is a unique ID of a particular block of data.This uniue ID is getting used to identified
the block during reading of the data when client make a request to read data.Hence it can not be shared.
HDFS is designed to handle large files. Lets say you have a 1000Mb file. With a 4k block size, you'd have to make 256,000
requests to get that file (1 request per block). In HDFS, those requests go across a network and come with a lot of overhead.
Each request has to be processed by the Name Node to figure out where that block can be found. That's a lot of traffic!
If you use 64Mb blocks, the number of requests goes down to 16, greatly reducing the cost of overhead and load on the Name Node.
To keep these things in mind hadoop recommend large block size.
HDFS block size is a logical unit of splitting a large file into small chunks. This chunks is basically called a block.
These chunks/block is used during further parallel processing of the data.i.e. MapReduce Programming or other model
to read/process of that within HDFS.
If a file is small enough to fit in this logical block then one block will get assigned for the file and it will
take disk space according to file size and Unix file system you are using.The detail about, how file gets stored in disk is available on this link.
HDFS block size Vs actual file size
As HDFS block size is a logical unit not a physical unit of the memory, so there is no waste of memory.
These link will be useful to understand the problem with small file.
Link1,
Link2
See Kumar's Answer
You could look into SequenceFiles or HAR Files depending on your use case. HAR files are analogous to the Tar command. MapReduce can act upon each HAR files with a little overhead. As for SequenceFiles, they are in a way a container of Key/Value pairs. The benefit of this is a Map task can act upon each of these pairs.
HAR Files
Sequence Files
More About Sequence Files
In hadoop definitive guide :
a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not128 MB.
what does this mean ?
does it use 1MB of size in a block of 128MB or 1MB is used and reamining 127MB is free to occupy by some other file ?
This is often a misconception about HDFS - the block size is more about how a single file is split up / partitioned, not about some reserved part of the file system.
Behind the schemes, each block is stored on the DataNodes underlying files system as a plain file (and an associated checksum). If you look into the data node folder on your disks you should be able to find the file (if you know the file's block ID and data node allocations - which you can discover from the NameNode Web UI).
So back to your question, a 1MB file with a block size of 16MB/32MB/128MB/512MB/1G/2G (you get the idea) will still only be a 1MB file on the data nodes disk. The difference between the block size and the amount of data stored in that block is then free for the underlying file system to use as it sees fit (by HDFS, or something else).
Hadoop Block size is Hadoop Storage Concept. Every Time When you store a File in Hadoop it will divided into the block sizes and based on the replication factor and data locality it will be distributed over the cluster.
For Details you can find my answer here
Small files and HDFS blocks
This might seem as a silly question but in Hadoop suppose blocksize is X (typically 64 or 128 MB) and a local filesize is Y (where Y is less than X).Now when I copy file Y to the HDFS will it consume one block or will hadoop create smaller size blocks?
One block is consumed by Hadoop. That does not mean that storage capacity will be consumed in an equivalent manner.
The output while browsing the HDFS from web looks like this:
filename1 file 48.11 KB 3 128 MB 2012-04-24 18:36
filename2 file 533.24 KB 3 128 MB 2012-04-24 18:36
filename3 file 303.65 KB 3 128 MB 2012-04-24 18:37
You see that each file size is lesser than the block size which is 128 MB. These files are in KB.
HDFS capacity is consumed based on the actual file size but a block is consumed per file.
There are limited number of blocks available dependent on the capacity of the HDFS. You are wasting blocks as you will run out of them before utilizing all the actual storage capacity. Remember that Unix filsystem also has concept of blocksize but is a very small number around 512 Bytes. This concept is inverted in HDFS where the block size is kept bigger around 64-128 MB.
The other issue is that when you run map/reduce programs it will try to spawn mapper per block so in this case when you are processing three small files, it may end up spawning three mappers to work on them eventually.
This wastes resources when the files are of smaller size. You also add latency as each mapper takes time to spawn and then ultimately would work on a very small sized file. You have to compact them into files closer to blocksize to take advantage of mappers working on lesser number of files.
Yet another issue with numerous small files is that it loads namenode which keeps the mapping (metadata) of each block and chunk mapping in main memory. With smaller files, you fill up this table faster and more main memory will be required as metadata grows.
Read the following for reference:
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
http://www.ibm.com/developerworks/web/library/wa-introhdfs/
Oh! there is a discussion on SO : Small files and HDFS blocks