Can I run Spark with segmented file in each slave node? - hadoop

Imagine I have two slaves and one master. Previously I have copied and pasted same data in all slave nodes.
JavaPairRDD<IntWritable, VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class,
VectorWritable.class);
Here inputpath is not an HDFS path, rather a local path that each slave node has access to. But now I am considering a situation where each slave has partial data, and I want to use same code, without installing/working with HDFS. But problem is after running the same code, program runs without any error but does not produce any result. Because
The master has no data in the "inputPath".
The slaves has partial data in the "inputPath", but master didnot distribute any data from it to them to distribute workload.
My question is how can I run my program, without any third party program, in this new situation?

You cannot. If you want to run Spark
without installing/working with HDFS
(or other distributed storage), you have to provide a full copy of data on each node, including driver. Obviously it is not something very useful in practice.

Related

What does it mean by 'local file system?'

I'm currently reading about hadoop and I came across this which has puzzled me (please bear in mind that I am a complete novice when it comes to hadoop) -
Use the Hadoop get command to copy a file from HDFS to your local file
system:
$ hadoop hdfs dfs -get file_name /user/login_user_name
What is a local file system? I understand that a HDFS partitions a file into different blocks throughout the cluster (but I know there's more to it than that). My understanding of the above command is that I can copy a file from the cluster to my personal (i.e. local) computer? Or is that completely wrong? I'm just not entirely sure what is meant by a local file system.
LocalFS means it may be your LinuxFS or WindowsFS. And which is not part of the DFS.
Your understanding is correct, using -get you are getting file from HDFS to Local FS and you cannot use both hadoop and hdfs. Command should be like below
hdfs dfs -get file_name local_path or hadoop fs -get file_name local_path
As per the file system logic, you can divide the filesystem into different drives. in the same way, you can create hadoop file system as a separate filesystem in linux file system.
Your local file system would be the File system over which you have installed the hadoop.your machine act as local in this case when copying file from your machine to the hadoop.
you might want to look at :HDFS vs LFS
THINK of a cluster node (server) as having to fulfill 2 needs:
the need to store its own operating system, application and user data-related files; and
the need to store its portion of sharded or "distributed" cluster data files.
In each cluster data node then there needs to be 2 independent file systems:
the LOCAL ("non-distributed") file system:
stores the OS and all OS-related ancillary ("helper") files;
stores the binary files which form the applications which run on the server;
stores additional data files, but these exist as simple files which are NOT sharded/replicated/distributed in the server's "cluster data" disks;
typically comprised of numerous partitions - entire formatted portions of a single disk or multiple disks;
typically also running LVM in order to ensure "expandability" of these partitions containing critical OS-related code which cannot be permitted to saturate or the server will suffer catastrophic (unrecoverable) failure.
AND
the DISTRIBUTED file system:
stores only the sharded, replicated portions of what are actually massive data files "distributed" across all the other data drives of all the other data nodes in the cluster
typically comprised of at least 3 identical disks, all "raw" - unformatted, with NO RAID of any kind and NO LVM of any kind, because the cluster software (installed on the "local" file system) is actually responsible for its OWN replication and fault-tolerance, so that RAID and LVM would actually be REDUNDANT, and therefore cause unwanted LATENCY in the entire cluster performance.
LOCAL <==> OS and application and data and user-related files specific or "local" to the specific server's operation itself;
DISTRIBUTED <==> sharded/replicated data; capable of being concurrently processed by all the resources in all the servers in the cluster.
A file can START in a server's LOCAL file system where it is a single little "mortal" file - unsharded, unreplicated, undistributed; if you were to delete this one copy, the file is gone gone GONE...
... but if you first MOVE that file to the cluster's DISTRIBUTED file system, where it becomes sharded, replicated and distributed across at least 3 different drives likely on 3 different servers which all participate in the cluster, so that now if a copy of this file on one of these drives were to be DELETED, the cluster itself would still contain 2 MORE copies of the same file (or shard); and WHERE in the local system your little mortal file could only be processed by the one server and its resources (CPUs + RAM)...
... once that file is moved to the CLUSTER, now it's sharded into myriad smaller pieces across at least 3 different servers (and quite possibly many many more), and that file can have its little shards all processed concurrently by ALL the resources (CPUs & RAM) of ALL the servers participating in the cluster.
And that the difference between the LOCAL file system and the DISTRIBUTED file system operating on each server, and that is the secret to the power of cluster computing :-) !...
Hope this offers a clearer picture of the difference between these two often-confusing concepts!
-Mark from North Aurora

Understanding file handling in hadoop

I am new to Hadoop ecosystem with some basic idea. Please assist on following queries to start with:
If the file size (file that am trying to copy into HDFS) is very big and unable to accommodate with the available commodity hardware in my Hadoop ecosystem system, what can be done? Will the file wait until it gets an empty space or the there is an error?
How to find well in advance or predict the above scenario will occur in a Hadoop production environment where we continue to receive files from outside sources?
How to add a new node to a live HDFS ecosystem? There are many methods but I wanted to know which files I need to alter?
How many blocks does a node have? If I assume that a node is a CPU with storage(HDD-500 MB), RAM(1GB) and a processor(Dual Core). In this scenario is it like 500GB/64? assuming that each block is configured to hold 64 GB RAM
If I copyFromLocal a 1TB file into HDFS, which portion of the file will be placed in which block in which node? How can I know this?
How can I find which record/row of the input file is available in which file of the multiple files split by Hadoop?
What are the purpose of each xmls configured? (core-site.xml,hdfs-site.xml & mapred-site.xml). In a distributed environment, which of these files should be placed in all the slave Data Nodes?
How to know how many map and reduce jobs will run for any read/write activity? Will the write operation always have 0 reducer?
Apologize for asking some of the basic questions. Kindly suggest methods to find answers for all of the above queries.

HDFS How to know from which host we get a file

I use the command line and I want to know from which host I get a file (or which replica I get).
Normally it should be the nearest to me. But I changed a policy for the project. Thus I want to check the final results to see if my new policy works correctly.
Following command does not give any information:
hadoop dfs -get /file
And the next one gives me only the replica's position, but not which one is preferred for the get:
hadoop fsck /file -files -blocks -locations
HDFS abstracts this information away as it is not very useful for users to know where they are reading from (the filesystem is designed to be as less in your way as possible). Typically, the DFSClient intends to pick up the data in order of the hosts returned to it (moving onto an alternative in case of a failure). The hosts returned to it is sorted by the NameNode for appropriate data or rack locality - and that is how the default scenario works.
While the proper answer for your question would be to write good test cases that can both simulate and assert this, you can also run your program with the Hadoop logger set to DEBUG, to check the IPC connections made to various hosts (including DNs) when reading the files - and go through these to assert manually that your host-picking is working as intended.
Another way would be to run your client through a debugger and observe the parts around the connections made finally to retrieve data (i.e. after NN RPCs).
Thanks,
We finally use the networks statistics with a simple test case to find where hadoop takes the replicas.
But the easiest way is to print the array nodes modified by this method:
org.apache.hadoop.net.NetworkTopology pseudoSortByDistance( Node reader, Node[] nodes )
As we expected, the get of the replicas is based on the results of the methods. The firsts items are preferred. Normally the first item is taken except if there is an error with the node. For more information about this method, see Replication

How does CopyFromLocal command for Hadoop DFS work?

I'm a little confused on how the Hadoop Distributed File System is set up and how my particular setup affects it. I used this guide to set it up http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ using two Virtual Machines on Virtual Box and have run the example (just a simple word count with txt file input). So far, I know that the datanode manages and retrieves the files on its node, while the tasktracker analyzes the data.
1) When you use the command -copyFromLocal, are you are copying files/input to the HDFS? Does Hadoop know how to divide the information between the slaves/master, and how does it do it?
2) In the configuration outlined in the guide linked above, are there technically two slaves (the master acts as both the master and a slave)? Is this common or is the master machine usually only given jobtracker/namenode tasks?
There are lot of questions asked here.
Question 2)
There are two machines
These machines are configured for HDFS and Map-Reduce.
HDFS configuration requires Namenode (master) and Datanodes (Slave)
Map-reduce requires Jobtracker (master) and Tasktracker (Slave)
Only one Namenode and Jobtracker is configured but you can have Datanode and Tasktracker services on both the machines. It is not the machine which acts as master and slave. It is just the services. You can have slave services also installed on machines which contains master services. It is good for simple development setup. In large scale deployment, you dedicate master services to separate machines.
Question 1 Part 2)
It is HDFS job to create file chunk and store on multiple data nodes in replicated manner. You don't have to worry about it.
Question 1 Part 1)
Hadoop file operations are patterned like typical Unix file operations - ls, put etc
Hadoop fs -put localefile /data/somefile --> will copy a localfile to HDFS at path /data/somefile
With put option you can also read from standard input and write to a HDFS file
copyFromLocal is similar to put option except that behavior is restricted to copying from local file system to HDFS
See: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html#copyFromLocal
1)
The client connects to the name node to register a new file in HDFS.
The name node creates some metadata about the file (either using the default block size, or a configured value for the file)
For each block of data to be written, the client queries the name node for a block ID and list of destination datanodes to write the data to. Data is then written to each of the datanodes.
There is some more information in the Javadoc for org.apache.hadoop.hdfs.DFSClient.DFSOutputStream
2) Some production systems will be configured to make the master its own dedicated node (allowing the maximum possible memory allocation, and to avoid CPU contention), but if you have a smaller cluster, then a node which contains a name node and data node is acceptable

How to control file assignation in different slave in hadoop distributed system?

How to control file assignation in different slave in hadoop distributed system?
Is it possible to write 2 or more file in hadoop as map reduce task Simultaneously?
I am new to hadoop.It will be really helpful to me.
If you know please answer.
This is my answer for your #1:
You can't directly control where map tasks go in your cluster or where files get sent in your cluster. The JobTracker and the NameNode handle these, respectively. The JobTracker will try to send the map tasks to be data local to improve performance. (I had to guess what you meant for your question , if I didn't get it right, please elaborate)
This is my answer for your #2:
MultipleOutputs is what you are looking for when you want to write multiple files out from a single reducer.

Resources