How to control file assignation in different slave in hadoop distributed system? - hadoop

How to control file assignation in different slave in hadoop distributed system?
Is it possible to write 2 or more file in hadoop as map reduce task Simultaneously?
I am new to hadoop.It will be really helpful to me.
If you know please answer.

This is my answer for your #1:
You can't directly control where map tasks go in your cluster or where files get sent in your cluster. The JobTracker and the NameNode handle these, respectively. The JobTracker will try to send the map tasks to be data local to improve performance. (I had to guess what you meant for your question , if I didn't get it right, please elaborate)
This is my answer for your #2:
MultipleOutputs is what you are looking for when you want to write multiple files out from a single reducer.

Related

Can I run Spark with segmented file in each slave node?

Imagine I have two slaves and one master. Previously I have copied and pasted same data in all slave nodes.
JavaPairRDD<IntWritable, VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class,
VectorWritable.class);
Here inputpath is not an HDFS path, rather a local path that each slave node has access to. But now I am considering a situation where each slave has partial data, and I want to use same code, without installing/working with HDFS. But problem is after running the same code, program runs without any error but does not produce any result. Because
The master has no data in the "inputPath".
The slaves has partial data in the "inputPath", but master didnot distribute any data from it to them to distribute workload.
My question is how can I run my program, without any third party program, in this new situation?
You cannot. If you want to run Spark
without installing/working with HDFS
(or other distributed storage), you have to provide a full copy of data on each node, including driver. Obviously it is not something very useful in practice.

Understanding file handling in hadoop

I am new to Hadoop ecosystem with some basic idea. Please assist on following queries to start with:
If the file size (file that am trying to copy into HDFS) is very big and unable to accommodate with the available commodity hardware in my Hadoop ecosystem system, what can be done? Will the file wait until it gets an empty space or the there is an error?
How to find well in advance or predict the above scenario will occur in a Hadoop production environment where we continue to receive files from outside sources?
How to add a new node to a live HDFS ecosystem? There are many methods but I wanted to know which files I need to alter?
How many blocks does a node have? If I assume that a node is a CPU with storage(HDD-500 MB), RAM(1GB) and a processor(Dual Core). In this scenario is it like 500GB/64? assuming that each block is configured to hold 64 GB RAM
If I copyFromLocal a 1TB file into HDFS, which portion of the file will be placed in which block in which node? How can I know this?
How can I find which record/row of the input file is available in which file of the multiple files split by Hadoop?
What are the purpose of each xmls configured? (core-site.xml,hdfs-site.xml & mapred-site.xml). In a distributed environment, which of these files should be placed in all the slave Data Nodes?
How to know how many map and reduce jobs will run for any read/write activity? Will the write operation always have 0 reducer?
Apologize for asking some of the basic questions. Kindly suggest methods to find answers for all of the above queries.

Adding new files to a running hadoop cluster

consider that you have 10GB data and you want to process them by a MapReduce program using Hadoop. Instead of copying all the 10GB at the beginning to HDFS and then running the program, I want to for example copy 1GB and start the work and gradually add the remaining 9GB during the time. I wonder if it is possible in Hadoop.
Thanks,
Morteza
Unfortunately this is not possible with MapReduce. When you initiate a MapReduce Job, part of the setup process is determining block locations of your input. If the input is only partially there, the setup process will only work on those blocks and wont dynamically add inputs.
If you are looking for a stream processor, have a look at Apache Storm https://storm.apache.org/ or Apache Spark https://spark.apache.org/

Merge HDFS files without going through the network

I could do this:
hadoop fs -text /path/to/result/of/many/reudcers/part* | hadoop fs -put - /path/to/concatenated/file/target.csv
But it will make the HDFS file get streamed through the network. Is there a way to tell the HDFS to merge few files on the cluster itself?
I have problem similar to yours.
Here is article with number of HDFS files merging options but all of them have some specifics. No one from this list meets my requirements. Hope this could help you.
HDFS concat (actually FileSystem.concat()). Not so old API. Requires original file to have last block full.
MapReduce jobs: probably I will take some solution based on this technology but it's slow to setup.
copyMerge - as far as I can see this will be again copy. But I did not check details yet.
File crush - again, looks like MapReduce.
So main result is if MapReduce setup speed suits you, no problem. If you have realtime requirements, things are getting complex.
One of my 'crazy' ideas is to use HBase coprocessor mechanics (endpoints) and files blocks locality information for this as I have Hbase on the same cluster. If the word 'crazy' doesn't stop you, look at this: http://blogs.apache.org/hbase/entry/coprocessor_introduction

How does CopyFromLocal command for Hadoop DFS work?

I'm a little confused on how the Hadoop Distributed File System is set up and how my particular setup affects it. I used this guide to set it up http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ using two Virtual Machines on Virtual Box and have run the example (just a simple word count with txt file input). So far, I know that the datanode manages and retrieves the files on its node, while the tasktracker analyzes the data.
1) When you use the command -copyFromLocal, are you are copying files/input to the HDFS? Does Hadoop know how to divide the information between the slaves/master, and how does it do it?
2) In the configuration outlined in the guide linked above, are there technically two slaves (the master acts as both the master and a slave)? Is this common or is the master machine usually only given jobtracker/namenode tasks?
There are lot of questions asked here.
Question 2)
There are two machines
These machines are configured for HDFS and Map-Reduce.
HDFS configuration requires Namenode (master) and Datanodes (Slave)
Map-reduce requires Jobtracker (master) and Tasktracker (Slave)
Only one Namenode and Jobtracker is configured but you can have Datanode and Tasktracker services on both the machines. It is not the machine which acts as master and slave. It is just the services. You can have slave services also installed on machines which contains master services. It is good for simple development setup. In large scale deployment, you dedicate master services to separate machines.
Question 1 Part 2)
It is HDFS job to create file chunk and store on multiple data nodes in replicated manner. You don't have to worry about it.
Question 1 Part 1)
Hadoop file operations are patterned like typical Unix file operations - ls, put etc
Hadoop fs -put localefile /data/somefile --> will copy a localfile to HDFS at path /data/somefile
With put option you can also read from standard input and write to a HDFS file
copyFromLocal is similar to put option except that behavior is restricted to copying from local file system to HDFS
See: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html#copyFromLocal
1)
The client connects to the name node to register a new file in HDFS.
The name node creates some metadata about the file (either using the default block size, or a configured value for the file)
For each block of data to be written, the client queries the name node for a block ID and list of destination datanodes to write the data to. Data is then written to each of the datanodes.
There is some more information in the Javadoc for org.apache.hadoop.hdfs.DFSClient.DFSOutputStream
2) Some production systems will be configured to make the master its own dedicated node (allowing the maximum possible memory allocation, and to avoid CPU contention), but if you have a smaller cluster, then a node which contains a name node and data node is acceptable

Resources