what is Hadoop block abstraction. Need more details on this - hadoop

I was going through Hadoop Definitive guide and was not cleared with below concepts.
Block abstraction, can someone elaborate more on this.
Making the unit of abstraction a block rather than a file simplifies the storage subsystem.
a.) what is unit of abstraction a block ?
b.) How to make unit of abstraction?
c.) How does it simplifies storage subsystem ?

HDFS Block abstraction:
HDFS block size is of 64MB-128MB(usually) and unlike other filesystems, a file smaller than the block size does not occupy the complete block size’s worth of memory.
The block size is kept so large so that less time is made doing disk seeks as compared to the data transfer rate.
Why block abstraction:
Files can be bigger than individual disks
Filesystem metadata does not need to be associated with each and every block.
Simplifies storage management - Easy to figure out the number of blocks which can be stored on each disk.
Fault tolerance and storage replication can be easily done on a per-block basis (storage/HA policies can be run on individual blocks).

HDFS in some ways is just another filesystem and it, like all the others, breaks files into blocks. Key differences here are that the blocks are big (ex: 128MB) instead of something small (ex: 4KB) and each block is replicated on different servers in the larger HDFS architecture.
Most of us don't work directly with blocks, we work with files and one could argue that this "block abstraction" is really for two purposes.
First, it let's the storage subsystem (HDFS) scale to massive level by continuing to add servers.
Second, it let's the frameworks (like MapReduce, TEZ, HBase, Spark, etc) align their tactical work to these blocks when processing the logical full file.

Related

HDFS behavior on lots of small files and 128 Mb block size

I have lots (up to hundreds of thousands) of small files, each 10-100 Kb. I have HDFS block size equal 128 MB. I have replication factor equal 1.
Is there any drawbacks of allocating HDFS block per small file?
I've seen pretty contradictory answers:
Answer which said the smallest file takes the whole block
Answer which said that HDFS is clever enough, and small file will take small_file_size + 300 bytes of metadata
I made a test like in this answer, and it proves that the 2nd option is correct - HDFS doesn't allocate the whole block for small files.
But, how about batch read of 10.000 small files from HDFS? Does it will be slow down because of 10.000 blocks and metadatas? Is there any reason to keep multiple small files within single block?
Update: my use case
I have only one use case for small files, from 1.000 up to 500.000. I calculate that files once, store it, and than read them all at once.
1) As I understand, NameNode space problem is not a problem for me. 500.000 is an absolute maximum, I will never have more. If each small file takes 150 bytes on NN, than the absolute maximum for me is - 71.52 MB, which is acceptable.
2) Does Apache Spark eliminate MapReduce problem? Will sequence files or HAR help me to solve the issue? As I understand, Spark shouldn't depend on Hadoop MR, but it's still too slow. 490 files takes 38 seconds to read, 3420 files - 266 seconds.
sparkSession
.read()
.parquet(pathsToSmallFilesCollection)
.as(Encoders.kryo(SmallFileWrapper.class))
.coalesce(numPartitions);
As you have noticed already, the HDFS file does not take anymore space than it needs, but there are other drawbacks of having the small files in the HDFS cluster. Let's go first through the problems without taking into consideration batching:
NameNode(NN) memory consumption. I am not aware about Hadoop 3 (which is being currently under development) but in previous versions NN is a single point of failure (you can add secondary NN, but it will not replace or enhance the primary NN at the end). NN is responsible for maintaining the file-system structure in memory and on the disk and has limited resources. Each entry in file-system object maintained by NN is believed to be 150 bytes (check this blog post). More files = more RAM consumed by the NN.
MapReduce paradigm (and as far as I know Spark suffers from the same symptoms). In Hadoop Mappers are being allocated per split (which by default corresponds to the block), this means, that for every small file you have out there a new Mapper will need to be started to process its contents. The problem is that for small files it actually takes much more for Hadoop to start the Mapper than process the file content. Basically, you system will be doing unnecessary work of starting/stopping Mappers instead of actually processing the data. This is the reason Hadoop processes much fast 1 128MBytes file (with 128MBytes blocks size) rather than 128 1MBytes files (with same block size).
Now, if we talk about batching, there are few options you have out there: HAR, Sequence File, Avro schemas, etc. It depends on the use case to give the precise answers to your questions. Let's assume you do not want to merge files, in this case you might be using HAR files (or any other solution featuring efficient archiving and indexing). In this case the NN problem is solved, but the number of Mappers still will be equal to the number of splits. In case merging files into large one is an option, you can use Sequence File, which basically aggregates small files into bigger ones, solving to some extend both problems. In both scenarios though you cannot really update/delete the information directly like you would be able to do with small files, thus more sophisticated mechanisms are required for managing those structures.
In general, in the main reason for maintaining many small files is an attempt to make fast reads, I would suggest to take a look to different systems like HBase, which were created for fast data access, rather than batch processing.

To hadoop or not to hadoop

We have data (not allot at this point) that we want to transform/aggregate/pivot up to wazoo.
I had a look on the www and all the answers i am asking is pointing to hadoop for scalable,cheap to run(no SQL server machine and license),fast(if you have allot of data), programmable(not little boxes that you drag around).
There is just one problem that i keep coming up against
namely 'Use hadoop if you have more than 10gb of data'
Now we don't even have 1gb of data(at this stage) is it still viable.
My other option is SSIS. Now we do use SSIS for some of our current ETL but we don't have resources for it and putting a SQL in the cloud is just going to cost to much and don't even get me started on scalability cost and config.
thanks
Your current data volume seems to be too low for making an entry into hadoop. Enter into hadoop ecosystem only if you are dealing with huge volume of data(TB/year) and if you suspect the data volume to increase exponentially down the line.
Let me explain why I suggest against hadoop for such low volume of data.
By default hadoop stores your files into 128MB chunks of data and while processing also, it takes 128MB Chunks at a time to process(parallely). If your business requirement involves heavy CPU intensive processing, then you can decrease the input chunk size from 128MB to less. But then again by decreasing the amount of data to be processed parallely, you'll end up increasing the number of IO seaks(low level block storage). At the end you might be spending more resource on managing the tasks rather than what the actual task is taking. Hence, try avoiding distributed computing as a solution for your(low) data volume.
As #Makubex has suggested, don't use hadoop.
And SISS is a good option as it handles the data in-memory so it would perform data aggregations, data type conversions, merging, etc at a much faster rate than writing to the disk using temporary tables in stored procedures.
Hadoop is meant for large amounts of data I would suggest it only for data in terabytes. It would be way slower that SISS(which runs in-memory) for small data-sets.
Refer: When to use T-SQL or SSIS for ETL

Intention of Hadoop FS is to keep in RAM or disk?

We are thinking about going with Hadoop in my company. From looking at the docs in the Internet I got the impression the idea of HDFS is to keep it in RAM to speed up things. Now our architects says that the main idea of HDFS is scalability. I'm fine with that. But then he also claims the main idea is to keep it on the on the harddisk. HDFS is basically a scalable harddisk. My opinion is that backing HDFS by the harddisk is an option. The main idea, however, is to keep it in RAM. Who is right now? I'm really confused now and the point is crucial for the understanding of Hadoop, I would say.
Thanks, Oliver
Oliver, your architect is correct. Horizontal scalability is one of the biggest advantages of HDFS(Hadoop in general). When you say Hadoop it implies that you are dealing with very huge amounts of data, right? How are you going to put so much data in-memory?(I am assuming that by the idea of HDFS is to keep it in RAM to speed up things you mean to keep the data stored in HDFS in RAM).
But, the HDFS's metadata is kept in-memory so that you can quickly access the data stored in your HDFS. Remember, HDFS is not something physical. It is rather a virtual filesystem that lies on top of your native filesystem. So, when you say you are storing data into HDFS, it eventually gets stored in your native/local filesystem on your machine's disk and not RAM.
Having said that, there are certain major differences in the way HDFS and native FS behave. Like the block size which is very large when compared to local FS block size. Similarly the replicated manner in which data is stored in HDFS(think of RAID but at the software level).
So how does HDFS make things faster?
Hadoop is a distributed platform and HDFS a distributed store. When you put a file into HDFS it gets split into n small blocks(of size 64MB default, but configurable). Then all the blocks
of a file get stored across all the machines of your Hadoop cluster. This allows us to read all of the block together in parallel thus reducing the total reading time.
I would suggest you to go through this link in order to get a proper understanding of HDFS :
http://hadoop.apache.org/docs/stable/hdfs_design.html
HTH

Serve static files from Hadoop

My job is to design a distributed system for static image/video files. The size of the data is about tens of Terabytes. It's mostly for HTTP access (thus no processing on data; or only simple processing such as resizing- however it's not important because it can be done directly in the application).
To be a little more clear, it's a system that:
Must be distributed (horizontal scale), because the total size of data is very big.
Primarily serves small static files (such as images, thumbnails, short videos) via HTTP.
Generally, no requirement on processing the data (thus MapReduce is not needed)
Setting HTTP access on the data could be done easily.
(Should have) good throughput.
I am considering:
Native network file system: But it seems not feasible because the data can not fit into one machine.
Hadoop filesystem. I worked with Hadoop mapreduce before, but I have no experience using Hadoop as a static file repository for HTTP requests. So I don't know if it's possible or if it's a recommended way.
MogileFS. It seems promising, but I feel that using MySQL to manage local files (on a single machine) will create too much overhead.
Any suggestion please?
I am the author of Weed-FS. For your requirement, WeedFS is ideal. Hadoop can not handle many small files, in addition to your reasons, each file needs to have an entry in the master. If the number of files are big, the hdfs master node can not scale.
Weed-FS is getting faster when compiled with latest Golang releases.
Many new improvements have been done on Weed-FS recently. Now you can test and compare very easily with the built-in upload tool. This one upload all files recursively under a directory.
weed upload -dir=/some/directory
Now you can compare by "du -k /some/directory" to see the disk usage, and "ls -l /your/weed/volume/directory" to see the Weed-FS disk usage.
And I suppose you would need replication with data center, rack aware, etc. They are in now!
Hadoop is optimized for large files e.g. It's default block size is 64M. A lot of small files are both wasteful and hard to manage on Hadoop.
You can take a look at other distributed file systems e.g. GlusterFS
Hadoop has a rest API for acessing files. See this entry in the documentation. I feel that Hadoop is not meant for storing large number of small files.
HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes. The block size is 64 mb. So even if the file is of 10kb, it would be allocated an entire block of 64 mb. Thats a waste disk space.
If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 files of 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.
In "Hadoop Summit 2011", there was this talk by Karthik Ranganathan about Facebook Messaging in which he gave away this bit: Facebook stores data (profiles, messages etc) over HDFS but they dont use the same infra for images and videos. They have their own system named Haystack for images. Its not open source but they shared the abstract design level details about it.
This brings me to weed-fs: an open source project for inspired by Haystacks' design. Its tailor made for storing files. I have not used it till now but seems worth a shot.
If you are able to batch the files and have no requirement to update a batch after adding to HDFS, then you could compile multiple small files into a single larger binary sequence file. This is a more efficient way to store small files in HDFS (as Arnon points out above, HDFS is designed for large files and becomes very inefficient when working with small files).
This is the approach I took when using Hadoop to process CT images (details at Image Processing in Hadoop). Here the 225 slices of the CT scan (each an individual image) were compiled into a single, much larger, binary sequence file for long streaming reads into Hadoop for processing.
Hope this helps!
G

what does " local caching of data" mean in the context of this article?

From the following paragraphs of Text——
(http://developer.yahoo.com/hadoop/tutorial/module2.html),It mentions that sequential readable large files are not suitable for local caching. but I don't understand what does local here mean...
There are two assumptions in my opinion: one is Client caches data from HDFS and the other is datanode caches hdfs data in its local filesystem or Memory for Clients to access quickly. is there anyone who can explain more? Thanks a lot.
But while HDFS is very scalable, its high performance design also restricts it to a
particular class of applications; it is not as general-purpose as NFS. There are a large
number of additional decisions and trade-offs that were made with HDFS. In particular:
Applications that use HDFS are assumed to perform long sequential streaming reads from
files. HDFS is optimized to provide streaming read performance; this comes at the expense of
random seek times to arbitrary positions in files.
Data will be written to the HDFS once and then read several times; updates to files
after they have already been closed are not supported. (An extension to Hadoop will provide
support for appending new data to the ends of files; it is scheduled to be included in
Hadoop 0.19 but is not available yet.)
Due to the large size of files, and the sequential nature of reads, the system does
not provide a mechanism for local caching of data. The overhead of caching is great enough
that data should simply be re-read from HDFS source.
Individual machines are assumed to fail on a frequent basis, both permanently and
intermittently. The cluster must be able to withstand the complete failure of several
machines, possibly many happening at the same time (e.g., if a rack fails all together).
While performance may degrade proportional to the number of machines lost, the system as a
whole should not become overly slow, nor should information be lost. Data replication
strategies combat this problem.
Any real Mapreduce job is probably going to process GB's (10/100/1000s) of data from HDFS.
Therefore any one mapper instance is most probably going to be processing a fair amount of data (typical block size is 64/128/256 MB depending on your configuration) in a sequential nature (it will read the file / block in its entirety from start to end.
It is also unlikely that another mapper instance running on the same machine will want to process that data block again any time in the immediate future, more so that multiple mapper instances will also be processing data alongside this mapper in any one TaskTracker (hopefully with a fair few being 'local' to actually physical location of the data, i.e. a replica of the data block also exists on the same machine the mapper instance is running).
With all this in mind, caching the data read from HDFS is probably not going to gain you much - you'll most probably not get a cache hit on that data before another block is queried and will ultimately replace it in the cache.

Resources