HDFS configuration & what is the user directory for? - hadoop

I am currently "playing around" with Hadoop in a VM (CDH4.1.3 image from cloudera). What I am wondering about is the following (and the documentation did not really help me in that regard).
Following the tutorial, I would format a NameNode first - OK, that is already done if one uses the cloudera image. Likewise the HDFS file structure is already present. In the hdfs-site.xml the datanode data dir is set to:
/var/lib/hadoop-hdfs/cache/${user.name}/dfs/data
which is obviously where the blocks are supposed to be copied to in a real distributed setting. In the cloudera tutorial, one is told to create hdfs "home directories" for each user (/users/<username>), which I do not understand what they are for. Are they just for local test-runs in a single-node setup?
Say I really had petabytes of data on type not fitting into my local storage. This data would have to be distributed straight away, rendering a local "home directory" entirely useless.
Could someone tell me, just to give me an intuition, how a real Hadoop workflow with massive data would look like? What kind of distinct nodes would I have running for a start?
There's the master (JobTracker) with its slave file (where would I put that) allowing the master to resolve all the DataNodes. Then there is my NameNode that keeps track of where the block IDs are stored. The DataNodes are also carry TaskTracker responsibility. In the config files, the NameNode's URI is included -- am I correct so far? Then there is still the ${user.name} variable in the configuration which apparently, if I understood it right, has something to do with WebHDFS, which would also be great if someone could explain to me. In the running examples, the directions tend to be hardcoded to
/var/lib/hadoop-hdfs/cache/1/dfs/data, /var/lib/hadoop-hdfs/cache/2/dfs/data and so on.
So, back to the example: Say, I have my tape and want to import data into my HDFS (and I am required to stream data into the filesystem because I lack the local storage to save it locally on a single machine). Where would I start with the migration process? On an arbitrary DataNode? On the NameNode that distributes the chunks? After all, I cannot assume the data just to "be there", because the name node has to be aware of the block IDs.
It would be great if someone could shortly elaborate on these topics:
What is the home directory really for?
Do I migrate data to the home directory first and to the real distributed system afterwards?
How does WebHDFS work and what role does it play with regards to the user.name variable
How would I migrate "big data" into my HDFS on the fly - or even if it's not big data, how do I populate my file system in a proper way (meaning, that the chunks get randomly distributed across the cluster?

What is the home directory really for?
You have a small confusion here. Just like /home exists for local filesystems on Linux, where users are given their own storage space, /users is a home mount ON the HDFS (Distributed FS). The tutorial needs you to administratively create a home directory for the user you wish to later be running data loads and queries as, such that they get adequate permissions and storage access onto the HDFS. The tutorial is not asking you to create these directories locally.
Do I migrate data to the home directory first and to the real distributed system afterwards?
I believe my above answer should clarify this for you. You should create your home directory on the HDFS, and then load all your data inside of that directory.
How does WebHDFS work and what role does it play with regards to the user.name variable
WebHDFS is one of the various ways to access HDFS. Regular clients to talk to HDFS require use of Java APIs. WebHDFS (and also HttpFs) techniques were added to HDFS to let other languages have their own set of APIs by providing a REST front-end to HDFS. WebHDFS allows user-authentication, to help persist the permission and security models.
How would I migrate "big data" into my HDFS on the fly - or even if it's not big data, how do I populate my file system in a proper way (meaning, that the chunks get randomly distributed across the cluster?
The large part of problem HDFS solves for you is that of managing distribution of data. When loading files or data streams to HDFS (via CLI tools, sinks from Apache Flume, etc.), the blocks are spread in an ideal distribution by HDFS itself, and the chunking is managed by it as well. All you need to do is use the user-side regular FileSystem style APIs and forget about what goes where underneath - its all managed for you.

Related

What is the recommended DefaultFS (File system) for Hadoop on ephemeral Dataproc clusters?

What is the recommended DefaultFS (File system) for Hadoop on Dataproc. Are there any benchmarks, considerations available around using GCS vs HDFS as the default file system?
I was also trying to test things out and discovered that when I set the DefaultFS to a gs:// path, the Hive scratch files are getting created - both on HDFS as well as the GCS paths. Is this happening synchronously and adding to latency or does the write to GCS happen after the fact?
Would appreciate any guidance, reference around this.
Thank you
PS: These are ephemeral Dataproc clusters that are going to be using GCS for all persistent data.
HDFS is faster. There should already be public benchmarks for that, or just taken as a fact because GCS is networked storage where HDFS is directly mounted in the Dataproc VMs.
"Recommended" would be persistent storage, though, so GCS, but maybe only after finalizing the data in the applications. For example, you might not want Hive scratch files in GCS since they'll never be used outside of the current query session, but you would want Spark checkpoints if you're running periodic batch jobs that scale down the HDFS cluster in between executions
I would say the default (HDFS) is the recommended. Typically, the input and output data of Dataproc jobs are persisted outside of the cluster in GCS or BigQuery, the cluster is used for compute and intermediate data. These intermediate data are stored on local disks directly or through HDFS which eventually also goes to local disks. After the job is done, you can safely delete the cluster, only pay for the storage of input and output data to save cost.
Also HDFS usually has lower latency for intermediate data, especially for lots of small files and metadata operations, e.g. dir rename. GCS is better at throughput for large files.
But when using HDFS, you need to provision sufficient disk space (at least 1TB each node) and consider using local SSDs. See https://cloud.google.com/dataproc/docs/support/spark-job-tuning#optimize_disk_size for more details.

How to archive data stored in HDFS files on another (non-distributed) server?

I have a project folder containing approx. 50 GB of parquet files on a hadoop cluster (CDH 5.14), which I need to archive and move to another host (non-distributed with Windows or Linux). This is only a one time job - I do not plan to bring the data back to HDFS any time soon, however there should be a way to deploy it back to a distributed file system. What would be the optimal way to do it? Unfortunately, I don't have another hadoop cluster or a cloud environment where I could place this data.
I would appreciate any hints.
The optimal solution can depend on the actual data (e.g. Tables, many/few flat files). If you know how they got in there, looking at the inverse could be a logical first step.
For example, if you just use put to place the files, consider using get.
If you use Nifi to get it in, try Nifi to get it out.
After the data is on your Linux box, you can use SCP or something like FTP or a mounted drive to move it to the desired computer.

data backup and recovery in hadoop 2.2.0

I am new to Hadoop and much interested in Hadoop Administration,so i tried to install Hadoop 2.2.0 in Ubuntu 12.04 as pseudo distributed mode and installed successfully and run some example jar files also ,now i am trying learn further ,trying to learn data back up and recovery part now,can anyone tell ways to take data back back up and recovery it in hadoop 2.2.0 ,and also please suggest any good books for Hadoop Adminstration and steps to learn Hadoop Adminstration.
Thanks in Advance.
There is no classic backup and recovery functionality in Hadoop. There are several reasons for this:
HDFS uses block level replication for data protection via redundancy.
HDFS scales out massively in size, and it is becoming more economic to backup to disk, rather than tape.
The size of "Big Data" doesn't lend itself to being easily backed up.
Instead of backups, Hadoop uses data replication. Internally, it creates multiple copies of each block of data (by default, 3 copies). It also has a function called 'distcp', which allows you to replicate copies of data between clusters. This is what's typically done for "backups" by most Hadoop operators.
Some companies, like Cloudera, are incorporating the distcp tool into creating a 'backup' or 'replication' service for their distribution of Hadoop. It operates against a specific directory in HDFS, and replicates it to another cluster.
If you really wanted to create a backup service for Hadoop, you can create one manually yourself. You would need some mechanism of accessing the data (NFS gateway, webFS, etc), and could then use tape libraries, VTLs, etc. to create backups.

Writing to local file during map phase in hadoop

Hadoop writes the intermediate results to the local disk and the results of the reducer to the HDFS. what does HDFS mean. What does it physically translate to
HDFS is the Hadoop Distributed File System. Physically, it is a program running on each node of the cluster that provides a file system interface very similar to that of a local file system. However, data written to HDFS is not just stored on the local disk but rather is distributed on disks across the cluster. Data stored in HDFS is typically also replicated, so the same block of data may appear on multiple nodes in the cluster. This provides reliable access so that one node's crashing or being busy will not prevent someone from being able to read any particular block of data from HDFS.
Check out http://en.wikipedia.org/wiki/Hadoop_Distributed_File_System#Hadoop_Distributed_File_System for more information.
As Chase indicated, HDFS is Hadoop Distributed File System.
If I may, I recommend this tutorial and video of how HDFS and the Map/Reduce framework works and will serve you as a guide into the world of Hadoop: http://www.cloudera.com/resource/introduction-to-apache-mapreduce-and-hdfs/

Writing data to Hadoop

I need to write data in to Hadoop (HDFS) from external sources like a windows box. Right now I have been copying the data onto the namenode and using HDFS's put command to ingest it into the cluster. In my browsing of the code I didn't see an API for doing this. I am hoping someone can show me that I am wrong and there is an easy way to code external clients against HDFS.
There is an API in Java. You can use it by including the Hadoop code in your project.
The JavaDoc is quite helpful in general, but of course you have to know, what you are looking for *g *
http://hadoop.apache.org/common/docs/
For your particular problem, have a look at:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html
(this applies to the latest release, consult other JavaDocs for different versions!)
A typical call would be:
Filesystem.get(new JobConf()).create(new Path("however.file"));
Which returns you a stream you can handle with regular JavaIO.
For the problem of loading the data I needed to put into HDFS, I choose to turn the problem around.
Instead of uploading the files to HDFS from the server where they resided, I wrote a Java Map/Reduce job where the mapper read the file from the file server (in this case via https), then write it directly to HDFS (via the Java API).
The list of files is read from the input. I then have an external script that populates a file with the list of files to fetch, uploads the file into HDFS (using hadoop dfs -put), then start the map/reduce job with a decent number of mappers.
This gives me excellent transfer performance, since multiple files are read/written at the same time.
Maybe not the answer you were looking for, but hopefully helpful anyway :-).
About 2 years after my last answer, there are now two new alternatives - Hoop/HttpFS, and WebHDFS.
Regarding Hoop, it was first announced in Cloudera's blog and can be downloaded from a github repository. I have managed to get this version to talk successfully to at least Hadoop 0.20.1, it can probably talk to slightly older versions as well.
If you're running Hadoop 0.23.1 which at time of writing still is not released, Hoop is instead part of Hadoop as its own component, the HttpFS. This work was done as part of HDFS-2178. Hoop/HttpFS can be a proxy not only to HDFS, but also to other Hadoop-compatible filesystems such as Amazon S3.
Hoop/HttpFS runs as its own standalone service.
There's also WebHDFS which runs as part of the NameNode and DataNode services. It also provides a REST API which, if I understand correctly, is compatible with the HttpFS API. WebHDFS is part of Hadoop 1.0 and one of its major features is that it provides data locality - when you're making a read request, you will be redirected to the WebHDFS component on the datanode where the data resides.
Which component to choose depends a bit on your current setup and what needs you have. If you need a HTTP REST interface to HDFS now and you're running a version that does not include WebHDFS, starting with Hoop from the github repository seems like the easiest option. If you are running a version that includes WebHDFS, I would go for that unless you need some of the features Hoop has that WebHDFS lacks (access to other filesystems, bandwidth limitation, etc.)
Install Cygwin, install Hadoop locally (you just need the binary and configs that point at your NN -- no need to actually run the services), run hadoop fs -copyFromLocal /path/to/localfile /hdfs/path/
You can also use the new Cloudera desktop to upload a file via the web UI, though that might not be a good option for giant files.
There's also a WebDAV overlay for HDFS but I don't know how stable/reliable that is.
It seems there is a dedicated page now for this at http://wiki.apache.org/hadoop/MountableHDFS:
These projects (enumerated below) allow HDFS to be mounted (on most
flavors of Unix) as a standard file system using the mount command.
Once mounted, the user can operate on an instance of hdfs using
standard Unix utilities such as 'ls', 'cd', 'cp', 'mkdir', 'find',
'grep', or use standard Posix libraries like open, write, read, close
from C, C++, Python, Ruby, Perl, Java, bash, etc.
Later it describes these projects
contrib/fuse-dfs is built on fuse, some C glue, libhdfs and the hadoop-dev.jar
fuse-j-hdfs is built on fuse, fuse for java, and the hadoop-dev.jar
hdfs-fuse - a google code project is very similar to contrib/fuse-dfs
webdav - hdfs exposed as a webdav resource mapR - contains a closed source hdfs compatible file system that supports read/write
NFS access
HDFS NFS Proxy - exports HDFS as NFS without use of fuse. Supports Kerberos and re-orders writes so they are written to hdfs
sequentially.
I haven't tried any of these, but I will update the answer soon as I have the same need as the OP
You can now also try to use Talend, which includes components for Hadoop integration.
you can try mounting HDFS on your machine(call it machine_X) where you are executing your code and machine_X should have infiniband connectivity with the HDFS Check this out, https://wiki.apache.org/hadoop/MountableHDFS
You can also use HadoopDrive (http://hadoopdrive.effisoft.eu). It's a Windows shell extension.

Resources