Intention of Hadoop FS is to keep in RAM or disk? - hadoop

We are thinking about going with Hadoop in my company. From looking at the docs in the Internet I got the impression the idea of HDFS is to keep it in RAM to speed up things. Now our architects says that the main idea of HDFS is scalability. I'm fine with that. But then he also claims the main idea is to keep it on the on the harddisk. HDFS is basically a scalable harddisk. My opinion is that backing HDFS by the harddisk is an option. The main idea, however, is to keep it in RAM. Who is right now? I'm really confused now and the point is crucial for the understanding of Hadoop, I would say.
Thanks, Oliver

Oliver, your architect is correct. Horizontal scalability is one of the biggest advantages of HDFS(Hadoop in general). When you say Hadoop it implies that you are dealing with very huge amounts of data, right? How are you going to put so much data in-memory?(I am assuming that by the idea of HDFS is to keep it in RAM to speed up things you mean to keep the data stored in HDFS in RAM).
But, the HDFS's metadata is kept in-memory so that you can quickly access the data stored in your HDFS. Remember, HDFS is not something physical. It is rather a virtual filesystem that lies on top of your native filesystem. So, when you say you are storing data into HDFS, it eventually gets stored in your native/local filesystem on your machine's disk and not RAM.
Having said that, there are certain major differences in the way HDFS and native FS behave. Like the block size which is very large when compared to local FS block size. Similarly the replicated manner in which data is stored in HDFS(think of RAID but at the software level).
So how does HDFS make things faster?
Hadoop is a distributed platform and HDFS a distributed store. When you put a file into HDFS it gets split into n small blocks(of size 64MB default, but configurable). Then all the blocks
of a file get stored across all the machines of your Hadoop cluster. This allows us to read all of the block together in parallel thus reducing the total reading time.
I would suggest you to go through this link in order to get a proper understanding of HDFS :
http://hadoop.apache.org/docs/stable/hdfs_design.html
HTH

Related

Alluxio with/without HDFS

I have a cluster with HDFS as an under storage distributed file system, but I've just read about alluxio that is fast and flexible. So, My question is: Should I use Alluxio with HDFS or Alluxio is alternative for HDFS? (I see in their site that shared storage for under storage file system can be network file system (NFS). So, I think HDFS is not required. Correct me if I make a mistake).
In which mode performance is better: HDFS with Alluxio or Alluxio stanalone (what I mean the term standalone is to be used alone in the cluster and not locally).
Reply from Alluxio maintainer.
First of all, Alluxio is not a replacement for HDFS. Instead, it is a new abstraction layer on top of other distributed/cloud storage systems including HDFS, S3, Azure Object Store and other possible choices. In your case, if you data is already in HDFS, you will perhaps still keep HDFS as the persistent data layer for Alluxio.
The typical scenarios users put Alluxio in the picture and see significant benefits include:
Your physical data is not located with your compute. E.g., your bigdata engine is reading data from S3 or other object storage. In this case, by deploying Alluxio with compute nodes, one can make Alluxio work as a filesystem level cache to avoid fetching data across network repeatedly. See http://www.alluxio.org/overview/remote-data-acceleration
You are managing multiple storages and want to expose a single data access layer to simplify the management. E.g., one can "mount" multiple S3/ buckets into one Alluxio deployment so they appear as different directories under the same namespace. See http://www.alluxio.org/overview/storage-unification
Regarding your original performance question. The answer is, it depends. If your HDFS is remote from compute, you would expect a good performance gain. I also saw cases when HDFS is bottlenecked, Alluxio may also help to reduce the load and provides good SLA for certain mission-critical jobs.

Is it possible to create/work with a non-paralleized file in hadoop

we always talk about how much faster will be if we use hadoop to paralleized our data and programme .
I would like to know is that possible to keep a small file in one specific dataNode(not paralleized)?
possible to keep a small file in one specific dataNode
HDFS will try to split any file into HDFS blocks. The datanodes don't store the entire file, nor should you attempt to store on a particular one. Let Hadoop manage the data-locality.
Your file will be replicated 3 times by default in Hadoop for fault tolerance anyway.
If you have small files (less than the HDFS block size, 64 or 128MB, depending on the Hadoop version), then you probably shouldn't be using Hadoop. If you need parallelized processing, start with multi-threading. If you actually need distributed processes, my recommendation nowadays would be Spark or Flink, not Hadoop (MapReduce).
If you want this, seems like you want object storage, not block storage

HBase on Hadoop, data locality deep diving

I have read multiple articles about how HBase gain data locality i.e link
or HBase the Definitive guide book.
I have understood that when re-writing HFile, Hadoop would write the blocks on the same machine which is actually the same Region Server that made compaction and created bigger file on Hadoop. everything is well understood yet.
Questions:
Assuming a Region server has a region file (HFile) which is splitted on Hadoop to multiple block i.e A,B,C. Does that means all block (A,B,C) would be written to the same region server?
What would happen if HFile after compaction has 10 blocks (huge file), but region server doesn't have storage for all of them? does it means we loose data locality, since those blocks would be written on other machine?
Thanks for the help.
HBase uses HDFS API to write data to the distributed file sytem (HDFS). I know this will increase your doubt on the data locality.
When a client writes data to HDFS using the hdfs API, it ensures that a copy of the data is written to the local datatnode (if applicable) and then go for replication.
Now I will answer your questions,
Yes. HFile(blocks) written by a specific RegionServer(RS) resides in the local datanode until it is moved for load balancing or recovery by the HMaster(will be back on major compaction). So the blocks A,B,C would be there in the same region server.
Yes. This may happen. But we can control the same by configuring region start and end key for each regions for HBase tables at creation time, which allows the data to be equally distributed in the cluster.
Hope this helps.

Getting data into Hadoop

I come from a lot of SQL servers so it can be a bit difficult to picture exactly what happens to data when it goes into hadoop.
My understanding is that if you have a book in a text format that could be around 200k or so... you simply copy the data into hadoop and it becomes searchable. However does this data become part of a block so that HDFS can be more optimal or does it remain a 200k file in HDFS hurting performance?
Also is a Block what is often called a Tablet in Bigtable?
Thanks a lot for your help.
FlyMario
A file which is less than the block size of HDFS (default 64 megabytes) becomes part of a block, yes. But small files such as these might still hurt your performance in some cases, such as if you have a lot of these small files and you run a MapReduce job on them.
Vanilla Hadoop has nothing to do with Bigtable, and HDFS blocks aren't really comparable with tablets. While Hadoop's HDFS blocks have no knowledge of the data they're holding, Bigtable tablets are data-aware.

Hadoop as Data Archive System

I am analyzing on the possibilities to use hadoop (HDFS) as data archival solution which is giving linear scalability and lower cost maintenance per tera byte.
Please let me know the your recommendations and set of the parameters like I/O, Memory, Disk which has to be analyzed to viz hadoop as data archival system.
On the related query, While trying to upload a 500MB sized file using hadoop shell as,
$ #We've 500MB file created using dd
$ dd if=/dev/zero of=500MBFile.txt bs=524288000 count=1
$ hadoop fs -Ddfs.block.size=67108864 -copyFromLocal 500MBFile.txt /user/cloudera/
Please let me know why the input file is not getting splitted based on the block size (64MB). This will be good to understand since as part of data archival if we're getting 1TB file, how this will be splitted and distributed across the cluster.
I've tried the exercise using single node cloudera hadoop setup and replication factor is 1.
Thanks again for your great response.
You can use HDFS as archiving/storage solution, while I am doubt it is optimal. Specifically it is not as high-available as let say OpenStack Swift and not suited for store small files
In the same time if HDFS is Your choice I would suggest to build the cluster with storage oriented nodes. I would describe them as:
a) Put large and slow SATA disks. Since data is not going to be read / written constantly - desktop grade disks might do - it will be a major saving.
b) Put minimal memory - I would suggest 4 GB. It will not add much costs, but still enable ocaassional MR processing.
c) Sinlge CPU will do.
Regarding copyFromLocal. Yep, file is getting split according to the defined block size.
Distribution on cluster will be even across the cluster, taking to the account replication factor. HDFS will also try to put each block on more then one rack
You can load the file in .har format.
You get more details here : Hadoop Archives
Few inputs
Consider compression in your solution. Looks like you will be using Text files. You can achieve around 80% compression.
Make sure you select Hadoop friendly (i.e.splittable) compression

Resources