Storing mapreduce intermediate output on a remote server - hadoop

I use a hadoop (version 1.2.0) cluster of 16 nodes, one with a public IP (the master) and 15 connected through a private network (the slaves).
Is it possible to use a remote server (in addition to these 16 nodes) for storing the output of the mappers? The problem is that the nodes are running out of disk space during the map phase and I cannot compress map output any more.
I know that mapred.local.dirin mapred-site.xml is used to set a comma-separated list of dirs where the tmp files are stored. Ideally, I would like to have one local dir (the default one) and one directory on the remote server. When the local disk fills, then I would like to use the remote disk.

I am not very sure about about this but as per the link (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) it says that:
The local directory is a directory where MapReduce stores intermediate data files.
May be a comma-separated list of directories on different devices in
order to spread disk i/o. Directories that do not exist are ignored.
Also there are some other properties which you should check out. These might be of help:
mapreduce.tasktracker.local.dir.minspacestart: If the space in mapreduce.cluster.local.dir drops under this, do not ask for more tasks. Value in bytes
mapreduce.tasktracker.local.dir.minspacekill: If the space in mapreduce.cluster.local.dir drops under this, do not ask more tasks until all the current ones have finished and cleaned up. Also, to save the rest of the tasks we have running, kill one of them, to clean up some space. Start with the reduce tasks, then go with the ones that have finished the least. Value in bytes.

The solution was to use the iSCSI technology. A technician helped us out to achieve that, so unfortunately I am not able to provide more details on that.
We mounted the remote disk to a local path (/mnt/disk) of each slave node, and created a tmp file there, with rwx priviledges for all users.
Then, we changed the $HADOOP_HOME/conf/mapred-site.xml file and added the property:
<property>
<name>mapred.local.dir</name>
<value>/mnt/disk/tmp</value>
</property>
Initially, we had two, comma-separated values for that property, with the first being the default value, but it still didn't work as expected (we still got some "No space left on device" errors). So we left only one value there.

Related

How to proceed with a data node with corrupt disk file system

I would really appreciate help on the correct course of action. The setup is 3 ELK nodes which have all roles.
No shard replication is done. Node 3 experienced a failure on the disk which contains the data folder. An old copy (about a month) of that folder exists, and I know it would not be sufficient to copy the data in.
My question is, what is the correct course of action at this point which would return the stack to normal operation mode:
install a new disk and just launch the node? By a strike of luck, that was our least important data.
install the new disk and copy the old data and see if it can recover that data?
Also, would doing option 1, while launching an experimental node on which the data folder is mounted and restore whichever recoverable data and re-index them remotely to the original cluster?
Another option is to try to use the bin/elasticsearch-shard tool to see if you can repair part of the data.

What does it mean by 'local file system?'

I'm currently reading about hadoop and I came across this which has puzzled me (please bear in mind that I am a complete novice when it comes to hadoop) -
Use the Hadoop get command to copy a file from HDFS to your local file
system:
$ hadoop hdfs dfs -get file_name /user/login_user_name
What is a local file system? I understand that a HDFS partitions a file into different blocks throughout the cluster (but I know there's more to it than that). My understanding of the above command is that I can copy a file from the cluster to my personal (i.e. local) computer? Or is that completely wrong? I'm just not entirely sure what is meant by a local file system.
LocalFS means it may be your LinuxFS or WindowsFS. And which is not part of the DFS.
Your understanding is correct, using -get you are getting file from HDFS to Local FS and you cannot use both hadoop and hdfs. Command should be like below
hdfs dfs -get file_name local_path or hadoop fs -get file_name local_path
As per the file system logic, you can divide the filesystem into different drives. in the same way, you can create hadoop file system as a separate filesystem in linux file system.
Your local file system would be the File system over which you have installed the hadoop.your machine act as local in this case when copying file from your machine to the hadoop.
you might want to look at :HDFS vs LFS
THINK of a cluster node (server) as having to fulfill 2 needs:
the need to store its own operating system, application and user data-related files; and
the need to store its portion of sharded or "distributed" cluster data files.
In each cluster data node then there needs to be 2 independent file systems:
the LOCAL ("non-distributed") file system:
stores the OS and all OS-related ancillary ("helper") files;
stores the binary files which form the applications which run on the server;
stores additional data files, but these exist as simple files which are NOT sharded/replicated/distributed in the server's "cluster data" disks;
typically comprised of numerous partitions - entire formatted portions of a single disk or multiple disks;
typically also running LVM in order to ensure "expandability" of these partitions containing critical OS-related code which cannot be permitted to saturate or the server will suffer catastrophic (unrecoverable) failure.
AND
the DISTRIBUTED file system:
stores only the sharded, replicated portions of what are actually massive data files "distributed" across all the other data drives of all the other data nodes in the cluster
typically comprised of at least 3 identical disks, all "raw" - unformatted, with NO RAID of any kind and NO LVM of any kind, because the cluster software (installed on the "local" file system) is actually responsible for its OWN replication and fault-tolerance, so that RAID and LVM would actually be REDUNDANT, and therefore cause unwanted LATENCY in the entire cluster performance.
LOCAL <==> OS and application and data and user-related files specific or "local" to the specific server's operation itself;
DISTRIBUTED <==> sharded/replicated data; capable of being concurrently processed by all the resources in all the servers in the cluster.
A file can START in a server's LOCAL file system where it is a single little "mortal" file - unsharded, unreplicated, undistributed; if you were to delete this one copy, the file is gone gone GONE...
... but if you first MOVE that file to the cluster's DISTRIBUTED file system, where it becomes sharded, replicated and distributed across at least 3 different drives likely on 3 different servers which all participate in the cluster, so that now if a copy of this file on one of these drives were to be DELETED, the cluster itself would still contain 2 MORE copies of the same file (or shard); and WHERE in the local system your little mortal file could only be processed by the one server and its resources (CPUs + RAM)...
... once that file is moved to the CLUSTER, now it's sharded into myriad smaller pieces across at least 3 different servers (and quite possibly many many more), and that file can have its little shards all processed concurrently by ALL the resources (CPUs & RAM) of ALL the servers participating in the cluster.
And that the difference between the LOCAL file system and the DISTRIBUTED file system operating on each server, and that is the secret to the power of cluster computing :-) !...
Hope this offers a clearer picture of the difference between these two often-confusing concepts!
-Mark from North Aurora

Does hadoop use folders and subfolders

I have started learning Hadoop and just completed setting up a single node as demonstrated in hadoop 1.2.1 documentation
Now I was wondering if
When files are stored in this type of FS should I use a hierachial mode of storage - like folders and sub-folders as I do in Windows or files are just written into as long as they have a unique name?
Is it possible to add new nodes to the single node setup if say somebody were to use it in production environment. Or simply can a single node be converted to a cluster without loss of data by simply adding more nodes and editing the configuration?
This one I can google but what the hell! I am asking anyway, sue me. What is the maximum number of files I can store in HDFS?
When files are stored in this type of FS should I use a hierachial mode of storage - like folders and sub-folders as I do in Windows or files are just written into as long as they have a unique name?
Yes, use the directories to your advantage. Generally, when you run jobs in Hadoop, if you pass along a path to a directory, it will process all files in that directory. So.. you really have to use them anyway.
Is it possible to add new nodes to the single node setup if say somebody were to use it in production environment. Or simply can a single node be converted to a cluster without loss of data by simply adding more nodes and editing the configuration?
You can add/remove nodes as you please (unless by single-node, you mean pseudo-distributed... that's different)
This one I can google but what the hell! I am asking anyway, sue me. What is the maximum number of files I can store in HDFS?
Lots
To expand on climbage's answer:
Maximum number of files is a function of the amount of memory available to your Name Node server. There is some loose guidance that each metadata entry in the Name Node requires somewhere between 150-200 bytes of memory (it alters by version).
From this you'll need to extrapolate out to the number of files, and the number of blocks you have for each file (which can vary depending on file and block size) and you can estimate for a given memory allocation (2G / 4G / 20G etc), how many metadata entries (and therefore files) you can store.

HDFS How to know from which host we get a file

I use the command line and I want to know from which host I get a file (or which replica I get).
Normally it should be the nearest to me. But I changed a policy for the project. Thus I want to check the final results to see if my new policy works correctly.
Following command does not give any information:
hadoop dfs -get /file
And the next one gives me only the replica's position, but not which one is preferred for the get:
hadoop fsck /file -files -blocks -locations
HDFS abstracts this information away as it is not very useful for users to know where they are reading from (the filesystem is designed to be as less in your way as possible). Typically, the DFSClient intends to pick up the data in order of the hosts returned to it (moving onto an alternative in case of a failure). The hosts returned to it is sorted by the NameNode for appropriate data or rack locality - and that is how the default scenario works.
While the proper answer for your question would be to write good test cases that can both simulate and assert this, you can also run your program with the Hadoop logger set to DEBUG, to check the IPC connections made to various hosts (including DNs) when reading the files - and go through these to assert manually that your host-picking is working as intended.
Another way would be to run your client through a debugger and observe the parts around the connections made finally to retrieve data (i.e. after NN RPCs).
Thanks,
We finally use the networks statistics with a simple test case to find where hadoop takes the replicas.
But the easiest way is to print the array nodes modified by this method:
org.apache.hadoop.net.NetworkTopology pseudoSortByDistance( Node reader, Node[] nodes )
As we expected, the get of the replicas is based on the results of the methods. The firsts items are preferred. Normally the first item is taken except if there is an error with the node. For more information about this method, see Replication

Architecture - How to efficiently crawl the web with 10,000 machine?

Let’s pretend I have a network of 10,000 machines. I want to use all those machines to crawl the web as fast as possible. All pages should be downloaded only once. In addition there must be no single point of failure and we must minimize the number of communication required between machines. How would you accomplish this?
Is there anything more efficient than using consistent hashing to distribute the load across all machines and minimize communication between them?
Use a distributed Map Reduction system like Hadoop to divide the workspace.
If you want to be clever, or doing this in an academic context then try a Nonlinear dimension reduction.
Simplest implementation would probably be to use a hashing function on the name space key e.g. the domain name or URL. Use a Chord to assign each machine a subset of the hash values to process.
One Idea would be to use work queues (directories or DB), assuming you will be working out storage such that it meets your criteria for redundancy.
\retrieve
\retrieve\server1
\retrieve\server...
\retrieve\server10000
\in-process
\complete
1.) All pages to be seeds will be hashed and be placed in the queue using the hash as a file root.
2.) Before putting in the queue you check the complete and in-process queues to make sure you don't re-queue
3.) Each server retrieves a random batch (1-N) files from the retrieve queue and attempts to move it to the private queue
4.) Files that fail the rename process are assumed to have been “claimed” by another process
5.) Files that can be moved are to be processed put a marker in in-process directory to prevent re-queuing.
6.) Download the file and place it into the \Complete queue
7.) Clean file out of the in-process and server directories
8.) Every 1,000 runs check the oldest 10 in-process files by trying to move them from their server queues back into the general retrieve queue. This will help if a server hangs and also should load balance slow servers.
For the Retrieve, in-process and complete servers most file systems hate millions of files in 1 directory, Divide storage into segments based on the characters of the hash \abc\def\123\ would be the directory for file abcdef123FFFFFF…. If you were scaling to billions of downloads.
If you are using a mongo DB instead of a regular file store much of these problems would be avoided and you could benefit from the sharding etc…

Resources