Hadoop as Data Archive System - hadoop

I am analyzing on the possibilities to use hadoop (HDFS) as data archival solution which is giving linear scalability and lower cost maintenance per tera byte.
Please let me know the your recommendations and set of the parameters like I/O, Memory, Disk which has to be analyzed to viz hadoop as data archival system.
On the related query, While trying to upload a 500MB sized file using hadoop shell as,
$ #We've 500MB file created using dd
$ dd if=/dev/zero of=500MBFile.txt bs=524288000 count=1
$ hadoop fs -Ddfs.block.size=67108864 -copyFromLocal 500MBFile.txt /user/cloudera/
Please let me know why the input file is not getting splitted based on the block size (64MB). This will be good to understand since as part of data archival if we're getting 1TB file, how this will be splitted and distributed across the cluster.
I've tried the exercise using single node cloudera hadoop setup and replication factor is 1.
Thanks again for your great response.

You can use HDFS as archiving/storage solution, while I am doubt it is optimal. Specifically it is not as high-available as let say OpenStack Swift and not suited for store small files
In the same time if HDFS is Your choice I would suggest to build the cluster with storage oriented nodes. I would describe them as:
a) Put large and slow SATA disks. Since data is not going to be read / written constantly - desktop grade disks might do - it will be a major saving.
b) Put minimal memory - I would suggest 4 GB. It will not add much costs, but still enable ocaassional MR processing.
c) Sinlge CPU will do.
Regarding copyFromLocal. Yep, file is getting split according to the defined block size.
Distribution on cluster will be even across the cluster, taking to the account replication factor. HDFS will also try to put each block on more then one rack

You can load the file in .har format.
You get more details here : Hadoop Archives

Few inputs
Consider compression in your solution. Looks like you will be using Text files. You can achieve around 80% compression.
Make sure you select Hadoop friendly (i.e.splittable) compression

Related

Is it possible to create/work with a non-paralleized file in hadoop

we always talk about how much faster will be if we use hadoop to paralleized our data and programme .
I would like to know is that possible to keep a small file in one specific dataNode(not paralleized)?
possible to keep a small file in one specific dataNode
HDFS will try to split any file into HDFS blocks. The datanodes don't store the entire file, nor should you attempt to store on a particular one. Let Hadoop manage the data-locality.
Your file will be replicated 3 times by default in Hadoop for fault tolerance anyway.
If you have small files (less than the HDFS block size, 64 or 128MB, depending on the Hadoop version), then you probably shouldn't be using Hadoop. If you need parallelized processing, start with multi-threading. If you actually need distributed processes, my recommendation nowadays would be Spark or Flink, not Hadoop (MapReduce).
If you want this, seems like you want object storage, not block storage

Use HDFS instead of spark.local.dir

Trying to understand why Spark needs space on the local machine! Is there a way around it? I keep running into ‘No space left on device’. I understand that I can set ‘spark.local.dir’ to a comma-separated list, but is there a way to use HDFS instead?
I am trying to merge two HUGE datasets. On smaller datasets Spark is kicking butt of MapReduce, but I can’t claim victory until I prove with these huge datasets. I am not using YARN. Also, our Gateway nodes (aka Edge nodes) won’t have lot of free space.
Is there a way around this?
While groupByKey operation, Spark just writes into tmpDir serialized partitions. It's plain files (see ShuffledRDD guts, serializer, and so on), writing to HDFS is complicated enough.
Just set 'spark.local.dir' to free volume. This data needs only for this local machine, not for distributive data (like HDFS).

(HDFS) How to copy large data safely within a cluster?

I've got to make big sample data(say 1TB) and have approximately 20GB text files.
so I tried to just copy that 50times to make it that bigger, but every time I tried hadoop fs -cp command, some of my datanode die.
I heard that in UNIX , when deleting large data one can use SHRINK to safely remove data from disk. is there somthing like that in hadoop to copy large data?
In short, is there any way to copy large data safely within a hadoop cluster?
or do I have to modify some configuration files?
Try distcp. It runs MR job under the hood for copying data allowing us to leverage the parallelism provided by Hadoop.

Intention of Hadoop FS is to keep in RAM or disk?

We are thinking about going with Hadoop in my company. From looking at the docs in the Internet I got the impression the idea of HDFS is to keep it in RAM to speed up things. Now our architects says that the main idea of HDFS is scalability. I'm fine with that. But then he also claims the main idea is to keep it on the on the harddisk. HDFS is basically a scalable harddisk. My opinion is that backing HDFS by the harddisk is an option. The main idea, however, is to keep it in RAM. Who is right now? I'm really confused now and the point is crucial for the understanding of Hadoop, I would say.
Thanks, Oliver
Oliver, your architect is correct. Horizontal scalability is one of the biggest advantages of HDFS(Hadoop in general). When you say Hadoop it implies that you are dealing with very huge amounts of data, right? How are you going to put so much data in-memory?(I am assuming that by the idea of HDFS is to keep it in RAM to speed up things you mean to keep the data stored in HDFS in RAM).
But, the HDFS's metadata is kept in-memory so that you can quickly access the data stored in your HDFS. Remember, HDFS is not something physical. It is rather a virtual filesystem that lies on top of your native filesystem. So, when you say you are storing data into HDFS, it eventually gets stored in your native/local filesystem on your machine's disk and not RAM.
Having said that, there are certain major differences in the way HDFS and native FS behave. Like the block size which is very large when compared to local FS block size. Similarly the replicated manner in which data is stored in HDFS(think of RAID but at the software level).
So how does HDFS make things faster?
Hadoop is a distributed platform and HDFS a distributed store. When you put a file into HDFS it gets split into n small blocks(of size 64MB default, but configurable). Then all the blocks
of a file get stored across all the machines of your Hadoop cluster. This allows us to read all of the block together in parallel thus reducing the total reading time.
I would suggest you to go through this link in order to get a proper understanding of HDFS :
http://hadoop.apache.org/docs/stable/hdfs_design.html
HTH

Experience with Hadoop?

Have any of you tried Hadoop? Can it be used without the distributed filesystem that goes with it, in a Share-nothing architecture? Would that make sense?
I'm also interested into any performance results you have...
Yes, you can use Hadoop on a local filesystem by using file URIs instead of hdfs URIs in various places. I think a lot of the examples that come with Hadoop do this.
This is probably fine if you just want to learn how Hadoop works and the basic map-reduce paradigm, but you will need multiple machines and a distributed filesystem to get the real benefits of the scalability inherent in the architecture.
Hadoop MapReduce can run ontop of any number of file systems or even more abstract data sources such as databases. In fact there are a couple of built-in classes for non-HDFS filesystem support, such as S3 and FTP. You could easily build your own input format as well by extending the basic InputFormat class.
Using HDFS brings certain advantages, however. The most potent advantage is that the MapReduce job scheduler will attempt to execute maps and reduces on the physical machines that are storing the records in need of processing. This brings a performance boost as data can be loaded straight from the local disk instead of transferred over the network, which depending on the connection may be orders of magnitude slower.
As Joe said, you can indeed use Hadoop without HDFS. However, throughput depends on the cluster's ability to do computation near where data is stored. Using HDFS has 2 main benefits IMHO 1) computation is spread more evenly across the cluster (reducing the amount of inter-node communication) and 2) the cluster as a whole is more resistant to failure due to data unavailability.
If your data is already partitioned or trivially partitionable, you may want to look into supplying your own partitioning function for your map-reduce task.
The best way to wrap your head around Hadoop is to download it and start exploring the include examples. Use a Linux box/VM and your setup will be much easier than Mac or Windows. Once you feel comfortable with the samples and concepts, then start to see how your problem space might map into the framework.
A couple resources you might find useful for more info on Hadoop:
Hadoop Summit Videos and Presentations
Hadoop: The Definitive Guide: Rough Cuts Version - This is one of the few (only?) books available on Hadoop at this point. I'd say it's worth the price of the electronic download option even at this point ( the book is ~40% complete ).
Parallel/ Distributed computing = SPEED << Hadoop makes this really really easy and cheap since you can just use a bunch of commodity machines!!!
Over the years disk storage capacities have increased massively but the speeds at which you read the data have not kept up. The more data you have on one disk, the slower the seeks.
Hadoop is a clever variant of the divide an conquer approach to problem solving.
You essentially break the problem into smaller chunks and assign the chunks to several different computers to perform processing in parallel to speed things up rather than overloading one machine. Each machine processes its own subset of data and the result is combined in the end. Hadoop on a single node isn't going to give you the speed that matters.
To see the benefit of hadoop, you should have a cluster with at least 4 - 8 commodity machines (depending on the size of your data) on a the same rack.
You no longer need to be a super genius parallel systems engineer to take advantage of distributed computing. Just know hadoop with Hive and your good to go.
yes, hadoop can be very well used without HDFS. HDFS is just a default storage for Hadoop. You can replace HDFS with any other storage like databases. HadoopDB is an augmentation over hadoop that uses Databases instead of HDFS as a data source. Google it, you will get it easily.
If you're just getting your feet wet, start out by downloading CDH4 & running it. You can easily install into a local Virtual Machine and run in "pseudo-distributed mode" which closely mimics how it would run in a real cluster.
Yes You can Use local file system using file:// while specifying the input file etc and this would work also with small data sets.But the actual power of hadoop is based on distributed and sharing mechanism. But Hadoop is used for processing huge amount of data.That amount of data cannot be processed by a single local machine or even if it does it will take lot of time to finish the job.Since your input file is on a shared location(HDFS) multiple mappers can read it simultaneously and reduces the time to finish the job. In nutshell You can use it with local file system but to meet the business requirement you should use it with shared file system.
Great theoretical answers above.
To change your hadoop file system to local, you can change it in "core-site.xml" configuration file like below for hadoop versions 2.x.x.
<property>
<name>fs.defaultFS</name>
<value>file:///</value>
</property>
for hadoop versions 1.x.x.
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>

Resources