How does CopyFromLocal command for Hadoop DFS work? - hadoop

I'm a little confused on how the Hadoop Distributed File System is set up and how my particular setup affects it. I used this guide to set it up http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ using two Virtual Machines on Virtual Box and have run the example (just a simple word count with txt file input). So far, I know that the datanode manages and retrieves the files on its node, while the tasktracker analyzes the data.
1) When you use the command -copyFromLocal, are you are copying files/input to the HDFS? Does Hadoop know how to divide the information between the slaves/master, and how does it do it?
2) In the configuration outlined in the guide linked above, are there technically two slaves (the master acts as both the master and a slave)? Is this common or is the master machine usually only given jobtracker/namenode tasks?

There are lot of questions asked here.
Question 2)
There are two machines
These machines are configured for HDFS and Map-Reduce.
HDFS configuration requires Namenode (master) and Datanodes (Slave)
Map-reduce requires Jobtracker (master) and Tasktracker (Slave)
Only one Namenode and Jobtracker is configured but you can have Datanode and Tasktracker services on both the machines. It is not the machine which acts as master and slave. It is just the services. You can have slave services also installed on machines which contains master services. It is good for simple development setup. In large scale deployment, you dedicate master services to separate machines.
Question 1 Part 2)
It is HDFS job to create file chunk and store on multiple data nodes in replicated manner. You don't have to worry about it.
Question 1 Part 1)
Hadoop file operations are patterned like typical Unix file operations - ls, put etc
Hadoop fs -put localefile /data/somefile --> will copy a localfile to HDFS at path /data/somefile
With put option you can also read from standard input and write to a HDFS file
copyFromLocal is similar to put option except that behavior is restricted to copying from local file system to HDFS
See: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html#copyFromLocal

1)
The client connects to the name node to register a new file in HDFS.
The name node creates some metadata about the file (either using the default block size, or a configured value for the file)
For each block of data to be written, the client queries the name node for a block ID and list of destination datanodes to write the data to. Data is then written to each of the datanodes.
There is some more information in the Javadoc for org.apache.hadoop.hdfs.DFSClient.DFSOutputStream
2) Some production systems will be configured to make the master its own dedicated node (allowing the maximum possible memory allocation, and to avoid CPU contention), but if you have a smaller cluster, then a node which contains a name node and data node is acceptable

Related

Wrong IP mapping on some data nodes in hadoop

I have a hadoop setup on 7 nodes configured using local domains using /etc/hosts.
It looks like this
1.2.3.4 hadoop-master
1.2.3.5 hadoop-slave-1
1.2.3.6 hadoop-slave-2
1.2.3.7 hadoop-slave-3
1.2.3.8 hadoop-slave-4
1.2.3.9 hadoop-slave-5
1.2.3.10 hadoop-slave-6
Now the problem is, on some nodes, there is wrong mapping for hadoop-slave-1, that is, some nodes have hadoop-slave-1 mapped to 1.2.3.12 instead of 1.2.3.4.
Namenode has correct mapping, so data nodes show up fine in the the namenode UI.
Question is, will it be good to just change the /etc/hosts file and start the services?
I think it can corrupt some specific blocks related to the hadoop-slave-1 node.
I can think of 2 ways to fix this:
Fix the /etc/hosts file in the corrupt nodes and restart the service. But I am not sure if this could corrupt blocks. Is this assumption accurate?
We can remove this single server hadoop-slave-1 from the cluster temporarily and re-balance the Hadoop cluster to distribute whole data between the remaining 6 nodes and then again add the server back into the cluster and re-balance the data to 7 nodes.
But the problem with this is, data contained in the cluster is pretty big and could create a problem and also re-balancing the data will be heavy job and would create pressure on name node server and could cause heap issue.
Are there any other solution in this situation?
Also, which tool or utility you suggest for replicating data to another hadoop cluster?
Help much appreciated!!
In general, using /etc/hosts is discouraged if you have an functional DNS server (which most routers are).
For example, in my environment, I can ping namenode.lan
I think option 2 is the safest choice. hdfs -rebalancer works fine.
and could cause heap issue
Then stop the namenode, increase the heap, and start it back up. While you're at it, setup NameNode HA so you have no downtime.
Note: master/slave hostnames are really not descriptive. Each of HDFS and YARN and Hive and HBase and Spark all have server-client architectures with master services, and they should not be located on one machine.

Can I run Spark with segmented file in each slave node?

Imagine I have two slaves and one master. Previously I have copied and pasted same data in all slave nodes.
JavaPairRDD<IntWritable, VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class,
VectorWritable.class);
Here inputpath is not an HDFS path, rather a local path that each slave node has access to. But now I am considering a situation where each slave has partial data, and I want to use same code, without installing/working with HDFS. But problem is after running the same code, program runs without any error but does not produce any result. Because
The master has no data in the "inputPath".
The slaves has partial data in the "inputPath", but master didnot distribute any data from it to them to distribute workload.
My question is how can I run my program, without any third party program, in this new situation?
You cannot. If you want to run Spark
without installing/working with HDFS
(or other distributed storage), you have to provide a full copy of data on each node, including driver. Obviously it is not something very useful in practice.

What does it mean by 'local file system?'

I'm currently reading about hadoop and I came across this which has puzzled me (please bear in mind that I am a complete novice when it comes to hadoop) -
Use the Hadoop get command to copy a file from HDFS to your local file
system:
$ hadoop hdfs dfs -get file_name /user/login_user_name
What is a local file system? I understand that a HDFS partitions a file into different blocks throughout the cluster (but I know there's more to it than that). My understanding of the above command is that I can copy a file from the cluster to my personal (i.e. local) computer? Or is that completely wrong? I'm just not entirely sure what is meant by a local file system.
LocalFS means it may be your LinuxFS or WindowsFS. And which is not part of the DFS.
Your understanding is correct, using -get you are getting file from HDFS to Local FS and you cannot use both hadoop and hdfs. Command should be like below
hdfs dfs -get file_name local_path or hadoop fs -get file_name local_path
As per the file system logic, you can divide the filesystem into different drives. in the same way, you can create hadoop file system as a separate filesystem in linux file system.
Your local file system would be the File system over which you have installed the hadoop.your machine act as local in this case when copying file from your machine to the hadoop.
you might want to look at :HDFS vs LFS
THINK of a cluster node (server) as having to fulfill 2 needs:
the need to store its own operating system, application and user data-related files; and
the need to store its portion of sharded or "distributed" cluster data files.
In each cluster data node then there needs to be 2 independent file systems:
the LOCAL ("non-distributed") file system:
stores the OS and all OS-related ancillary ("helper") files;
stores the binary files which form the applications which run on the server;
stores additional data files, but these exist as simple files which are NOT sharded/replicated/distributed in the server's "cluster data" disks;
typically comprised of numerous partitions - entire formatted portions of a single disk or multiple disks;
typically also running LVM in order to ensure "expandability" of these partitions containing critical OS-related code which cannot be permitted to saturate or the server will suffer catastrophic (unrecoverable) failure.
AND
the DISTRIBUTED file system:
stores only the sharded, replicated portions of what are actually massive data files "distributed" across all the other data drives of all the other data nodes in the cluster
typically comprised of at least 3 identical disks, all "raw" - unformatted, with NO RAID of any kind and NO LVM of any kind, because the cluster software (installed on the "local" file system) is actually responsible for its OWN replication and fault-tolerance, so that RAID and LVM would actually be REDUNDANT, and therefore cause unwanted LATENCY in the entire cluster performance.
LOCAL <==> OS and application and data and user-related files specific or "local" to the specific server's operation itself;
DISTRIBUTED <==> sharded/replicated data; capable of being concurrently processed by all the resources in all the servers in the cluster.
A file can START in a server's LOCAL file system where it is a single little "mortal" file - unsharded, unreplicated, undistributed; if you were to delete this one copy, the file is gone gone GONE...
... but if you first MOVE that file to the cluster's DISTRIBUTED file system, where it becomes sharded, replicated and distributed across at least 3 different drives likely on 3 different servers which all participate in the cluster, so that now if a copy of this file on one of these drives were to be DELETED, the cluster itself would still contain 2 MORE copies of the same file (or shard); and WHERE in the local system your little mortal file could only be processed by the one server and its resources (CPUs + RAM)...
... once that file is moved to the CLUSTER, now it's sharded into myriad smaller pieces across at least 3 different servers (and quite possibly many many more), and that file can have its little shards all processed concurrently by ALL the resources (CPUs & RAM) of ALL the servers participating in the cluster.
And that the difference between the LOCAL file system and the DISTRIBUTED file system operating on each server, and that is the secret to the power of cluster computing :-) !...
Hope this offers a clearer picture of the difference between these two often-confusing concepts!
-Mark from North Aurora

How to put the reduce partitions into designed machines in hadoop cluster?

For example:
reduce results: part-00000, part-00001 ... part-00008,
the cluster has 3 datanodes and I want to
put the part-00000, part-00001 and part-00002 to the slave0
put the part-00003, part-00004 and part-00005 to the slave1
put the part-00006, part-00007 and part-00008 to the slave2
How can I do that?
It doesn't work like that. A file in HDFS is not stored in any specific datanode. Each file is composed of blocks and each block is replicated to multiple nodes (3 by default). So each file is actually stored in differed nodes, since the blocks that compose it are stored in different nodes.
Quoting the official documentation, which I advise you to read:
HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Seeing the partition tag in your question, it may be worth stating that the Partitioner defines in which partition (not datanode), each key will end up. For example, knowing that you have 9 reduce tasks (9 partitions), you may wish to evenly split the workload of each such task. In order to do that, you can define that, e.g., the keys starting with the letter "s" should be sent to partition 0 and the keys starting with the letter "a" or "b" to partition 1, etc. (just a stupid example to illustrate what a partitioner does).

Understanding file handling in hadoop

I am new to Hadoop ecosystem with some basic idea. Please assist on following queries to start with:
If the file size (file that am trying to copy into HDFS) is very big and unable to accommodate with the available commodity hardware in my Hadoop ecosystem system, what can be done? Will the file wait until it gets an empty space or the there is an error?
How to find well in advance or predict the above scenario will occur in a Hadoop production environment where we continue to receive files from outside sources?
How to add a new node to a live HDFS ecosystem? There are many methods but I wanted to know which files I need to alter?
How many blocks does a node have? If I assume that a node is a CPU with storage(HDD-500 MB), RAM(1GB) and a processor(Dual Core). In this scenario is it like 500GB/64? assuming that each block is configured to hold 64 GB RAM
If I copyFromLocal a 1TB file into HDFS, which portion of the file will be placed in which block in which node? How can I know this?
How can I find which record/row of the input file is available in which file of the multiple files split by Hadoop?
What are the purpose of each xmls configured? (core-site.xml,hdfs-site.xml & mapred-site.xml). In a distributed environment, which of these files should be placed in all the slave Data Nodes?
How to know how many map and reduce jobs will run for any read/write activity? Will the write operation always have 0 reducer?
Apologize for asking some of the basic questions. Kindly suggest methods to find answers for all of the above queries.

Resources