HDFS and redundancy - hadoop

I'm planning a data processing pipeline. My scenario is this:
A user uploads data to a server
This data should be distributed to one (and only one) node in my cluster. There is no distributed computing, just picking a node which has currently the least to do
The data processing pipeline gets data from some kind of distributed job engine. Though here is (finally) my question: many job engines rely on HDFS to work on the data. But since this data is processed on one node only, I'd rather like to avoid to distribute it. But my understanding is that HDFS keeps the data redundant - though I could not find any info if this means whether all data on HDFS is available on all nodes, or the data is mostly on the node where it is processed (locality).
It would be a concern to me due to IO reasons for my usage scenario if data on HDFS would completely redundant.

You can go with Hadoop (Map Reduce + HDFS) to solve your problem.
You can tell HDFS to store specific number of copies as you want. See below dfs.replication property. Set this value to 1 if you want only one copy.
conf/hdfs-site.xml - On master and all slave machines
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
Not necessary that HDFS copy data on each and every node. More info.
Hadoop is work on principle that 'Move code to Data'. Since moving code (mostly in MB's) demands very less network bandwidth than moving data in GB's or TB's, you no need to worry about data locality or network bandwidth. Hadoop take cares of it.

Related

Does uploading a file to HDFS automatically store the files in a distributed manner?

I just started learning Hadoop, and I am little confused regarding how the data is stored in a distributed manner. I have an MPI background. With MPI, we typically have a master processor that sends out data to various other processors. This is done explicitly by the programmer.
With Hadoop, you have a Hadoop Distributed File System (HDFS). So when you put some file from your local server into HDFS, does HDFS automatically store this file in a distributed manner without anything needed to be done by the programmer? The name, HDFS, seems to imply this, but I just wanted to verify.
Yes, it does.
The file is uploaded, the NameNode coordinates the replication based on the replication factor (usually 3) to the DataNodes where it is stored.
In addition, the NameNode has a job that looks for under-replicated files or blocks and will duplicate them to maintain the replication factor. See HDFS Architecture - Data Replication for more information.

Need of maintaining replication factor on datanodes

Please pardon if this question has come up earlier as I'm not able to find any related question for this.
1) I want to know the reason why it is important to maintain the same replication factor(or for that matter any configuration) across the datanodes and namenodes in the cluster?
2) When we upload any file to HDFS, isn't it the namenode which manages the storage?
3) Wouldn't maintaining the configuration only on the namenodes suffice?
4) What are the implications of having the configuration different across namenode and datanodes?
Any Help is much appreciated. Thank you! :)
I will try to answer your question taking replication as an example.
Few things to keep in mind -
Data always resides on datanodes, Namenode never deals with data or store data, it only keeps metadata about the data.
Replication factor is configurable, you can change it for every file copy, for example file1 may have replication factor of 2 while file2 may have replication factor of say 3, in a similar way some other properties can also be configured at the time of execution.
2) When we upload any file to HDFS, isn't it the namenode which manages the storage?
I am not sure about what you exactly mean by namenode managing the storage, here is how a file upload to hdfs gets executed -
1) Client sends a request to Namenode for file upload to hdfs
2) Namenode based on the configuration(if not explicitly specified by the client application) calculates the number of blocks data will be broken into.
3) Namenode also decides which Datanodes will store the blocks, based on the replication factor specified in configuration(if not explicitly specified by the client application)
4) Namenode sends information calculated in step #2 and #3 to the client
5) Client application will break the file into blocks and write each block to 'a' Datanode say DN1.
6) Now DN1 will be responsible to replicate the received blocks to other Datanodes as chosen by the Namenode in #3; It will initiate replication when Namenode instructs it.
For you questions #3 and #4, it is important to understand that any distributed application will require a set of configurations available with each node to be able to interact with each other and also perform designated task as per expectation. In case every node chooses to have its own configuration what would be the basis of co-ordination? DN1 has replication factor of 5, while DN2 has of 2 how would data be actually replicated?
Update start
hdfs-site.xml contains lots of other config specifications as well for namenode, datanode and secondary namenode, some client and hdfs specific settings and not just the replication factor.
Now imagine having a 50 node cluster, would you like to go and configure on each node or simply copy a pre-configured file?
Update end
If you keep all configurations at one location, each node will need to connect to that shared resource to load configuration every time it has to perform an action, this would add to latency apart from consistency/synchronization issues in case any config property is changed.
Hope this helps.
Hadoop is designed to deal with large datasets. It's not a good idea to store a large dataset on a single machine because if your storage system or hard disk crashes, you may lose all of your data.
Before Hadoop, people were using a traditional system to store large amounts of data, but the traditional system was very costly. There were also challenges while analyzing large datasets from the traditional system as it was time consuming process to read data from the traditional system. With these things in mind, the Hadoop Framework was designed.
In the hadoop framework, when you load large amounts of data, it splits the data into small chunks, known as blocks. These blocks are basically used to place the data into a datanode in a distributed cluster, and also they also are used during the analysis of the data.
The region behind the splitting of the data is parallel processing and distributed storage (i.e.: you can store your data onto multiple machines, and when you want to analyze it you can do it via parallel analysis).
Now Coming to your questions:
Reason: Hadoop is a framework which allows distributed storage and computing. In other words, this means you can store the data onto multiple machines. It has functionality of replication that means you are keeping multiple copy (based on the replication factor) of the same data.
Ans1: Hadoop is designed to run on the commodity hardware and failures are common on commodity hardware so suppose if you store the data on a single machine and when your machine get crashed you will lose your entire data. But in the hadoop cluster you can recover the data from another replication( if you have replication factor more than 1) as hadoop doesn't store replicated copy of the data on the same machine where your original replication resides.These things are handled from hadoop itself.
Ans2: When you upload file on the HDFS, your actual data goes to the datanode and NameNode keep the metadata information of your data. NameNode metadata information conatains are like block name, block location, filename, directory location of the file.
Ans3: You need to maintain entire configuration related to your hadoop cluster. Maintaining one configuration file is not sufficient and further you may face other problem.
Ans4: NameNode configurations properties are related to NameNode functionality like namespace services metadata location etc,RPC address that handles all clients requests Datanode configuration properties are related to services which is performed by the DataNode like storage balancing among the DataNode's volumes,available disk space,the DataNode server address and port for data transfer
Please check this link to understand more about the different configuration property.
Please provide more clarification about the question 3 and 4 if you think something more you want to know.

hadoop's poor scheduling of tasks

I am running some map reduce tasks on hadoop. The mapper is used to generate data and hence does not depend upon the hdfs block placement. To test my system I am using 2 nodes and one master node. I am doing my testing on hadoop-2.0 with yarn.
There is something very uncomfortable that I find with hadoop. I have configured it to run 8 maps tasks. Unfortunately hadoop is launching all the 8 map tasks on one node, and the other node is almost ideal. There are 4 reducers, and it does not balance these reducers too. It really results in a poor performance when that happens.
I have these properties set in mapred-site.xml in both the job tracker and task tracker
<property>
<name>mapreduce.tasktracker.map.tasks.maximum</name>
<value>2</value>
</property>
<property>
<name>mapreduce.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
</property>
Can some one explain if this problem can be solved or why does such problem exists with hadoop?
Don't think of mappers/reducers as one to one with servers. What it sounds like is happening is your system knows that the load is so low their is no need to launch reducers across the cluster. It is trying to avoid the network overhead of transfering files from master to the slave nodes.
Think of the number of mappers and reducers as how many concurrent threads you will allow your cluster to run. This is important when determing how much memory to allocate for each mapper/reducer.
To force an even distrubtion you could try allocating enough memory for each mapper/reducer to make it require a whole node. For example, 4 nodes, 8 mappers. Force each mapper to have 50% of the ram on each node. Not sure if this will work as expected, but really Hadoop load balancing itself is something good in theory, but might not seem that way for small data situations.

When I store files in HDFS, will they be replicated?

I am new to Hadoop.
When I store Excel files using hadoop -fs put commoad, it is stored in HDFS.
Replication factor is 3.
My question is: Does it take 3 copies and store them into 3 nodes each?
Here is a comic for HDFS working.
https://docs.google.com/file/d/0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1/edit?pli=1
Does it take 3 copies and store them into 3 nodes each.
answer is: NO
Replication is done in pipelining
that is it copies some part of file to datanode1 and then copies to datanode2 from datanode1 and to datanode3 from datanode1
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Replication+Pipelining
see here for Replication Pipelining
Your HDFS Client (hadoop fs in this case) will be given the block names and datanode locations (the first being the closest location if the NameNode can determine this from the rack awareness script) of where to store these files by the NameNode.
The client then copies the blocks to the closest Data node. The data node is then responsible for copying the block to a second datanode (preferably on another rack), where finally the second will copy to the third (on the same rack as the third).
So your client will only copy data to one of the data nodes, and the framework will take care of the replication between datanodes.
It will store the original file to one (or more in case of large files) blocks. These blocks will be replicated to two other nodes.
Edit: My answer applies to Hadoop 2.2.0. I have no experience with prior versions.
Yes it will be replicated in 3 nodes (maximum upto 3 nodes).
The Hadoop Client is going to break the data file into smaller “Blocks”, and place those blocks on different machines throughout the cluster. The more blocks you have, the more machines that will be able to work on this data in parallel. At the same time, these machines may be prone to failure, so it is safe to insure that every block of data is on multiple machines at once to avoid data loss.
So each block will be replicated in the cluster as its loaded. The standard setting for Hadoop is to have (3) copies of each block in the cluster. This can be configured with the dfs.replication parameter in the file hdfs-site.xml.
And replicating data is not a drawback of Hadoop at all, in fact it is an integral part of what makes Hadoop effective. Not only does it provide you with a good degree of fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network (read about data locality).
Yes it make n(replications factor) number copies in hdfs
use this command to find out the location of file, find #rack it is stored, what is the block name on all racks
hadoop fsck /path/to/your/directory -files -blocks -locations -racks
Use this command to load data into hdfs with replication
hadoop fs -Ddfs.replication=1 -put big.file /tmp/test1.file
and -Ddfs.replication=1 you can define number of replication copy will created while to loading data into hdfs

HBase: How does replication work?

I'm currently evaluating HBase as a Datastore, but one question was left unanswered: HBase stores many copies of the same object on many nodes (aka replication). As HBase features so-called strong consistency (in constrast to eventual consistent) it guarantees that every replica returns the same value if read.
As I understood the HBase concept, when reading values, first the HBase master is queried for a (there must be more than one) RegionServer providing the data. Then I can issue read and write requests without invention of the master. How can then replication work?
How does HBase provide concistency?
How do write operations internally work?
Do write operations block until all replicas are written (=> synchronous replication). If yes, who manages this transfer?
How does HDFS come into the game?
I have already read the BigTable-Paper and searched the docs, but I found no further information on the architecture of HBase.
Thanks!
hbase does not do any replication in the way that you are thinking. It is built on top of HDFS, which provides replication for the data blocks that make up the hbase tables. However, only one regionserver ever serves or writes data for any given row.
Usually regionservers are colocated with data nodes. All data writes in HDFS go to the local node first, if possible, another node on the same rack, and another node on a different rack (given a replication factor of 3 in HDFS). So, a region server will eventually end up with all of its data served from the local server.
As for blocking: the only block is until the WAL (write ahead log) is flushed to disk. This guarentees that no data is lost as the log can always be replayed. Note that older version of hbase did not have this worked out because HDFS did not support a durable append operation until recently. We are in a strange state for the moment as there is no official Apache release of Hadoop that supports both append and HBase. In the meantime, you can either apply the append patch yourself, or use the Cloudera distribution (recommended).
HBase does have a related replication feature that will allow you to replicate data from one cluster to another.

Resources