HDFS distributed reads without Map/Reduce - hadoop

Is it possible to achieve distributed reads from HDSF cluster using an HDFS client on one machine?
I have carried out an experiment with a cluster consisting of 3 data nodes (DN1,DN2,DN3). Then I run 10 simultaneous reads from 10 independent files from a client program located on DN1, and it appeared to be only reading data from DN1. Other data nodes (DN2,DN3) were showing zero activity (judging from debug logs).
I have checked that all files' blocks are replicated across all 3 datanodes, so if I shut down DN1 then data is read from DN2 (DN2 only).
Increasing the amount of data read did not help (tried from 2GB to 30GB).
Since I have a need to read multiple large files and extract only a small amount of data from it (few Kb), I would like to avoid using map/reduce since it requires settings up more services and also requires writing the output of each split task back to HDFS. Rather it would be nice to have the result streamed directly back to my client program from the data nodes.
I am using SequenceFile for reading/writing data, in this fashion (jdk7):
//Run in thread pool on multiple files simultaneously
List<String> result = new ArrayList<>();
LongWritable key = new LongWritable();
Text value = new Text();
try(SequenceFile.Reader reader = new SequenceFile.Reader(conf,
SequenceFile.Reader.file(filePath)){
reader.next(key);
if(key.get() == ID_I_AM_LOOKING_FOR){
reader.getCurrentValue(value);
result.add(value.toString());
}
}
return result; //results from multiple workers are merged later
Any help appreciated. Thanks!

I'm afraid the behavior you see is by-design. From Hadoop document:
Replica Selection
To minimize global bandwidth consumption and read latency, HDFS tries
to satisfy a read request from a replica that is closest to the
reader. If there exists a replica on the same rack as the reader node,
then that replica is preferred to satisfy the read request. If angg/
HDFS cluster spans multiple data centers, then a replica that is
resident in the local data center is preferred over any remote
replica.
It can be further confirmed by corresponding Hadoop source code:
LocatedBlocks getBlockLocations(...) {
LocatedBlocks blocks = getBlockLocations(src, offset, length, true, true);
if (blocks != null) {
//sort the blocks
DatanodeDescriptor client = host2DataNodeMap.getDatanodeByHost(
clientMachine);
for (LocatedBlock b : blocks.getLocatedBlocks()) {
clusterMap.pseudoSortByDistance(client, b.getLocations());
// Move decommissioned datanodes to the bottom
Arrays.sort(b.getLocations(), DFSUtil.DECOM_COMPARATOR);
}
}
return blocks;
}
I.e., all available replicas are tried one after another if former one fails but the nearest one is always the first.
On the other hand, if you access HDFS files through HDFS Proxy, it does pick datanodes randomly. But I don't think that's what you want.

In addition to what Edwardw said note that your current cluster is very small (just 3 nodes) and in this case you see the files on all the nodes. This happens because the default replication factor of Hadoop is also 3. In a larger cluster your files will not be available on each node and so accessing multiple files is likely to go to different nodes and spread the load.
If you work with smaller datasets you may want to look at HBase which lets you work with smaller chunks and spread the load between nodes (by splitting regions)

I would tell that your case sounds good for MR. If we put aside particular MR computational paradigm we can tell that hadoop is built to bring code to the data, instead of opposite. Moving code to the data is essential to get scalable data processing.
In other hand - setting up MapReduce is easier then HDFS - since it store no state between jobs.
In the same time - MR framework will care about parallel processing for you - something it will take time to do properly.
Another point- if results of data processing is so small - there will be no significant performance impact if you will combine them together in reducer.
In other words - I would suggest to reconsider use of the MapReduce.

Related

How to write to specific datanode in hdfs using pyspark

I have a requirement to write common data to the same hdfs data nodes, like how we repartition in pyspark on a column to bring similar data into the same worker node, even replicas should be in the same node.
For instance, we have a file, table1.csv
Id, data
1, A
1, B
2, C
2, D
And another tablet.csv
Id, data
1, X
1, Y
2, Z
2, X1
Then datanode1 should only have (1,A),(1,B),(1,X),(1,Y)
and datanode2 should only have (2,C),(2,D),(2,Z),(2,X1)
And replication within datanodes.
It can be separate files as well based on keys. But each key should map it to a particular node.
I tried with pyspark writing to hdfs, but it just randomly assigned the datanodes when I checked with hdfs DFS fsck.
Read about rackid by setting rack topology but is there away to select which rack to store data on?
Any help is appreciated, I'm totally stuck.
KR
Alex
I maitain that without actually exposing the problem this is not going to help you but as you technically asked for a solution here's a couple ways to do what you want, but won't actually solve the underlying problem.
If you want to shift the problem to resource starvation:
Spark setting:
spark.locality.wait - technically doesn't solve your problem but is actually likely to help you immediately before you implement anything else I list here. This is should be your goto move before trying anything else as it's trivial to try.
Pro: just wait until you get a node with the data. Cheap and fast to implement.
Con: Doesn't promise to solve data locality, just will wait for a while incase the right nodes come up. It doesn't guarantee that when you run your job it will be placed on the nodes with the data.
** yarn labels**
to allocate your worker nodes to specific nodes.
Pro: This should ensure at least 1 copy of the data lands within a set of worker nodes/data nodes. If subsequent jobs also use this node label you should get good data locality. Technically it doesn't specify where data is written but by caveat yarn will write to the hdfs node it's on first.
Con: You will create congestion on these nodes, or may have to wait for other jobs to finish so you can get allocated or you may carve these into a queue that no other jobs can access reducing the functional capacity of yarn. (HDFS will still work fine)
Use Cluster federation
Ensures data lands inside a certain set of machines.
pro: A folder can be assigned to a set of data nodes
Con: You have to allocated another name node, and although this satisfies your requirement it doesn't mean you'll get data locality. Great example of something that will fit the requirement but might not solve the problem. It doesn't guarantee that when you run your job it will be placed on the nodes with the data.
My-Data-is-Everywhere
hadoop dfs -setrep -w 10000 /path of the file
Turn up replication for just the folder that contains this data equal to the number of nodes in the cluster.
Pro: All your data nodes have the data you need.
Con: You are wasting space. Not necessarily bad, but can't really be done for large amounts of data without impeding your cluster's space.
Whack-a-mole:
Turn off datanodes, until the data is replicated where you want it. Turn all other nodes back on.
Pro: You fulfill your requirement.
Con: It's very disruptive to anyone trying to use the cluster. It doesn't guarantee that when you run your job it will be placed on the nodes with the data. Again it kinda points out how silly your requirement is.
Racking-my-brain
Someone smarter than me might be able to develop a rack strategy in your cluster that would ensure data is always written to specific nodes that you could then "hope" you are allocated to... haven't fully developed the strategy in my mind but likely some math genius could work it out.
You could also implement HBASE and allocate region servers such that the data landed on the 3 servers. (As this would technically fulfill your requirement). (As it would be on 3 servers and in HDFS)

Can Hadoop tasks run in parallel on single node

I am new to hadoop and I have following questions on the same.
This is what I have understood in hadoop.
1) When ever any file is written in hadoop it is stored across all the data nodes in chunks (64MB default)
2) When we run the MR job, a split will be created from this block and on each data node the split will be processed.
3) From each split record reader will be used to generate key/value pair at mapper side.
Questions :
1) Can one data node process more than one split at a time ? What if data node capacity is more?
I think this was limitation in MR1, and with MR2 YARN we have better resource utilization.
2) Will a split be read in serial fashion at data node or can it be processed in parallel to generate key/value pair? [ By randomly accessing disk location in data node split]
3) What is 'slot' terminology in map/reduce architecture? I was reading through one of the blogs and it says YARN will provide better slot utilization in Datanode.
Let me first address the what I have understood in hadoop part.
A file stored on Hadoop file system is NOT stored across all data nodes. Yes, it is split into chunks (default is 64MB), but the number of DataNodes on which these chunks are stored depends on a.File Size b.Current Load on Data Nodes c.Replication Factor and d.Physical Proximity. The NameNode takes these factors into account when deciding which dataNodes will store the chunks of a file.
Again each Data Node MAY NOT Process a split. Firstly, DataNodes are only responsible for managing the storage of data, not executing jobs/tasks. The TaskTracker is the slave node responsible for executing tasks on individual nodes. Secondly, only those nodes which contain the data required for that particular Job will process the splits, unless the load on these nodes is too high, in which case the data in the split is copied to another node and processed there.
Now coming to the questions,
Again, dataNodes are not responsible for processing jobs/tasks. We usually refer to a combination of dataNode + taskTracker as a node since they are commonly found on the same node, handling different responsibilities (data storage & running tasks). A given node can process more than one split at a time. Usually a single split is assigned to a single Map task. This translates to multiple Map tasks running on a single node, which is possible.
Data from input file is read in serial fashion.
A node's processing capacity is defined by the number of Slots. If a node has 10 slots, it means it can process 10 tasks in parallel (these tasks may be Map/Reduce tasks). The cluster administrator usually configures the number of slots per each node considering the physical configuration of that node, such as memory, physical storage, number of processor cores, etc.

Remotely retrieve a file from hdfs and store it locally in a node

I want to write a job in which each mapper checks if a file from hdfs is stored in the node that is being executed.If this doesn't happen I want to retrieve it from hdfs and store it locally in this node.Is this possible?
EDIT: I am trying to do this (3) Preprocessing for Repartition Join, as described here: link
DistributedCache feature in Hadoop can be used to distribute the side data or auxiliary data required for the completion of the job. Here (1, 2) are some interesting articles for the same.
Why would you want to do this? The Data Locality principle used by Hadoop does this for you. Well, it does not move the data, it does move the program.
This comes from the Wikipedia page about Hadoop:
The jobtracker schedules map/reduce jobs to tasktrackers with an
awareness of the data location. An example of this would be if node A
contained data (x,y,z) and node B contained data (a,b,c). The
jobtracker will schedule node B to perform map/reduce tasks on (a,b,c)
and node A would be scheduled to perform map/reduce tasks on (x,y,z)
And the reason the computation is moved to the data and not the other way around is explained in the Hadoop documentation itself:
“Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more efficient if it is executed
near the data it operates on. This is especially true when the size of
the data set is huge. This minimizes network congestion and increases
the overall throughput of the system. The assumption is that it is
often better to migrate the computation closer to where the data is
located rather than moving the data to where the application is
running. HDFS provides interfaces for applications to move themselves
closer to where the data is located.

Is the input format responsible for implementing data locality in Hadoop's MapReduce?

I am trying to understand data locality as it relates to Hadoop's Map/Reduce framework. In particular I am trying to understand what component handles data locality (i.e. is it the input format?)
Yahoo's Developer Network Page states "The Hadoop framework then schedules these processes in proximity to the location of data/records using knowledge from the distributed file system." This seems to imply that the HDFS input format will perhaps query the name node to determine which nodes contain the desired data and will start the map tasks on those nodes if possible. One could imagine a similar approach could be taken with HBase by querying to determine which regions are serving certain records.
If a developer writes their own input format would they be responsible for implementing data locality?
You're right. If you're looking at the FileInputFormat class and the getSplits() method. It searches for the Blocklocations:
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
This implies the FileSystem query. This happens inside the JobClient, the results getting written into a SequenceFile (actually it's just raw byte code).
So the Jobtracker reads this file later on while initializing the job and is pretty much just assigning a task to an inputsplit.
BUT the distribution of the data is the NameNodes job.
To your question now:
Normally you are extending from the FileInputFormat. So you will be forced to return a list of InputSplit, and in the initialization step it is required for such a thing to set the location of the split. For example the FileSplit:
public FileSplit(Path file, long start, long length, String[] hosts)
So actually you don't implement the data locality itself, you are just telling on which host the split can be found. This is easily queryable with the FileSystem interface.
Mu understanding is that data locality is jointly determined by HDFS and the InputFormat. The former determines (via rack awareness) and stores the location of HDFS blocks across datanodes while the latter will determine which blocks are associated with which split. The jobtracker will try to optimize which splits are delivered to which map task by making sure that the blocks associated for each split (1 split to 1 map task mapping) are local to the tasktracker.
Unfortunately, this approach to guaranteeing locality is preserved in homogeneous clusters but would break down in inhomogeneous ones i.e. ones where there are different sizes of hard disks per datanode. If you want to dig deeper on this you should read this paper (Improving MapReduce performance through data placement in heterogeneous hadoop clusters) that also touches on several topics relative to your question.

How can I be sure that data is distributed evenly across the hadoop nodes?

If I copy data from local system to HDFS, сan I be sure that it is distributed evenly across the nodes?
PS HDFS guarantee that each block will be stored at 3 different nodes. But does this mean that all blocks of my files will be sorted on same 3 nodes? Or will HDFS select them by random for each new block?
If your replication is set to 3, it will be put on 3 separate nodes. The number of nodes it's placed on is controlled by your replication factor. If you want greater distribution then you can increase the replication number by editing the $HADOOP_HOME/conf/hadoop-site.xml and changing the dfs.replication value.
I believe new blocks are placed almost randomly. There is some consideration for distribution across different racks (when hadoop is made aware of racks). There is an example (can't find link) that if you have replication at 3 and 2 racks, 2 blocks will be in one rack and the third block will be placed in the other rack. I would guess that there is no preference shown for what node gets the blocks in the rack.
I haven't seen anything indicating or stating a preference to store blocks of the same file on the same nodes.
If you are looking for ways to force balancing data across nodes (with replication at whatever value) a simple option is $HADOOP_HOME/bin/start-balancer.sh which will run a balancing process to move blocks around the cluster automatically.
This and a few other balancing options can be found in at the Hadoop FAQs
Hope that helps.
You can open HDFS Web UI on port 50070 of Your namenode. It will show you the information about data nodes. One thing you will see there - used space per node.
If you do not have UI - you can look on the space used in the HDFS directories of the data nodes.
If you have a data skew, you can run rebalancer which will solve it gradually.
Now with Hadoop-385 patch, we can choose the block placement policy, so as to place all blocks of a file in the same node (and similarly for replicated nodes). Read this blog about this topic - look at the comments section.
Yes, Hadoop distributes data per block, so each block would be distributed separately.

Resources