I am new to Hadoop and much interested in Hadoop Administration,so i tried to install Hadoop 2.2.0 in Ubuntu 12.04 as pseudo distributed mode and installed successfully and run some example jar files also ,now i am trying learn further ,trying to learn data back up and recovery part now,can anyone tell ways to take data back back up and recovery it in hadoop 2.2.0 ,and also please suggest any good books for Hadoop Adminstration and steps to learn Hadoop Adminstration.
Thanks in Advance.
There is no classic backup and recovery functionality in Hadoop. There are several reasons for this:
HDFS uses block level replication for data protection via redundancy.
HDFS scales out massively in size, and it is becoming more economic to backup to disk, rather than tape.
The size of "Big Data" doesn't lend itself to being easily backed up.
Instead of backups, Hadoop uses data replication. Internally, it creates multiple copies of each block of data (by default, 3 copies). It also has a function called 'distcp', which allows you to replicate copies of data between clusters. This is what's typically done for "backups" by most Hadoop operators.
Some companies, like Cloudera, are incorporating the distcp tool into creating a 'backup' or 'replication' service for their distribution of Hadoop. It operates against a specific directory in HDFS, and replicates it to another cluster.
If you really wanted to create a backup service for Hadoop, you can create one manually yourself. You would need some mechanism of accessing the data (NFS gateway, webFS, etc), and could then use tape libraries, VTLs, etc. to create backups.
Related
I am new to HBase, recently I installed HBase and tried to start it on my Mac. Everything is fine and I could play with HBase. In some articles, it said I should start Hadoop first when using HBase, I am wondering if this prerequisite changed?
Hadoop is not a hard requirement for HBase unless you are running fully distributed which you are not. Running on a single node like you are you can use the local filesystem. See HBase run modes: Standalone and Distributed for more information.
Your local filesystem (the file:// URI) is Hadoop-compatible. Hbase requires a Hadoop compatible storage layer, but that does not mean that it must literally be HDFS.
HDFS will simply provide scalability and reliability
I've seen redhat has come up one possible solution with GlusterFS working as the backend for hadoop. In this case, you can get ride of the namenode/datanode architecture and replace it with glusterfs, meanwhile you still have Hadoop Mapreduce api-compatibility.
Just wondering how does the performance compare against native-HDFS? Is it really production ready? Does it support all the hadoop ecosystem as well? e.g. Solr Cloud, Spark, Impala etc etc.
disclaimer: I work for Storage vendor.
Well. I don't know much about GlusterFS in particular but i can speak about Lustre as it's POSIX at the end of the day. It's parallel filesystem, but the benchmarks i looked into recently showed it does outperform HDFS. but it's definitely a production ready alternative that offers a single name space for your data (no more HDFS ingestion)
What does work from Hadoop ecosystem today?
what I've seen in the production today is Spark,Hive,Hbase. Imapala looks to me it require certain parts of HDFS, this is why it doesn't work with POSIX FS and it's not HCFS. I did a quick test and i was able to create the database and everything, but i wasn't able to fetch any rows.
Let me if you need further help.
My understanding was that Spark is an alternative to Hadoop. However, when trying to install Spark, the installation page asks for an existing Hadoop installation. I'm not able to find anything that clarifies that relationship.
Secondly, Spark apparently has good connectivity to Cassandra and Hive. Both have sql style interface. However, Spark has its own sql. Why would one use Cassandra/Hive instead of Spark's native sql? Assuming that this is a brand new project with no existing installation?
Spark is a distributed in memory processing engine. It does not need to be paired with Hadoop, but since Hadoop is one of the most popular big data processing tools, Spark is designed to work well in that environment. For example, Hadoop uses the HDFS (Hadoop Distributed File System) to store its data, so Spark is able to read data from HDFS, and to save results in HDFS.
For speed, Spark keeps its data sets in memory. It will typically start a job by loading data from durable storage, such as HDFS, Hbase, a Cassandra database, etc. Once loaded into memory, Spark can run many transformations on the data set to calculate a desired result. The final result is then typically written back to durable storage.
In terms of it being an alternative to Hadoop, it can be much faster than Hadoop at certain operations. For example a multi-pass map reduce operation can be dramatically faster in Spark than with Hadoop map reduce since most of the disk I/O of Hadoop is avoided. Spark can read data formatted for Apache Hive, so Spark SQL can be much faster than using HQL (Hive Query Language).
Cassandra has its own native query language called CQL (Cassandra Query Language), but it is a small subset of full SQL and is quite poor for things like aggregation and ad hoc queries. So when Spark is paired with Cassandra, it offers a more feature rich query language and allows you to do data analytics that native CQL doesn't provide.
Another use case for Spark is for stream processing. Spark can be set up to ingest incoming real time data and process it in micro-batches, and then save the result to durable storage, such as HDFS, Cassandra, etc.
So spark is really a standalone in memory system that can be paired with many different distributed databases and file systems to add performance, a more complete SQL implementation, and features they may lack such a stream processing.
Im writing a paper about Hadoop for university. And stumbled over your question. Spark is just using Hadoop for persistence and only if you want to use it. It's possible to use it with other persistence tiers like Amazon EC2.
On the other hand-side spark is running in-memory and it's not primarly build to be used for map reduce use-cases like Hadoop was/is.
I can recommend this article, if you like a more detailed description: https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
The README.md file in Spark can solve your puzzle:
A Note About Hadoop Versions
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
storage systems. Because the protocols have changed in different versions of
Hadoop, you must build Spark against the same version that your cluster runs.
Please refer to the build documentation at
"Specifying the Hadoop Version"
for detailed guidance on building for a particular distribution of Hadoop, including
building for particular Hive and Hive Thriftserver distributions.
I'm newbie on hadoop.
I heard that mapR is better way to mount hadoop HDFS rather than fuse.
But most of the related article just describe about mapR hadoop not pure apache hadoop.
Anyone has experience of mounting pure apache hadoop with mapR?
Thanks in advance.
MapR is much more than just a way to mount HDFS.
MapR includes Hadoop and many Apache eco-system components and many other non-Apache components such as Cascading. It also includes LucidWorks which includes Solr.
MapR also includes a reimplementation of HDFS called MaprFS. MaprFS has higher performance, has read-write semantics, allows read during write, supports transactionally correct mirrors and snapshots, has no name node, scales without the futzing of federation, is inherently HA without all the mess of the HA NameNode and which is accessible via a distributed NFS system.
Oh, MaprFS also supports the HBase API in addition to POSIX-ish access via NFS and in addition to the HDFS API.
The map-reduce layer in MapR has been partially re-written to make use of the extremely high performance capabilities of the file system. This is how MapR was able to break the minute sort record last fall.
So naming aside, MapR includes all the open source software that you would get with any other distribution and much more besides. "Pure Hadoop" is next to useless. You need Pig and/or Hive. You probably should look into Cascading/Scalding. You may need Mahout. You definitely will need to connect your system to legacy data sources and reporting systems which is what NFS makes easy.
Keep in mind that mounting HDFS via NFS or Fuze doesn't get you where you want to be. HDFS just doesn't have suitable semantics for access via NFS or normal file system API's. It just has too many compromises.
With MapR, on the other hand, you can even run databases like MySQL or Postgress on top of the clusters file system via NFS.
MapR comes in three editions.
M3 is free and gives you all the performance and scalability, but limits you to a single NFS server and no mirrors, snapshots, volume locality or HBase compatible API (you can run HBase itself, of course). HA is also degraded in M3 so that it takes an hour to fail over certain functions.
M5 costs money after the free trial period and gives you snapshots, mirrors, the ability to force some data to different topologies and unlimited NFS servers.
M7 also costs money and adds the HBase API to all that M5 can do.
See mapr.com for more info.
To sum up what Ted said as well,
You're not really "mounting pure apache hadoop with mapR?". Hadoop shouldn't be confused with HDFS. While they tend to be interchangeable during conversation, HDFS explicitly refers to the actual distributed filesystem (hence the DFS in HDFS). HDFS has to be interacted with using specific hadoop commands, i.e. "hadoop dfs ls /" will list the root contents of hdfs.
MapR went above and beyond what hadoop provides you be default. One, you can interact with the filesystem using the more efficient maprfs (a rewrite of hdfs). The other thing you can do is actually NFS mount the HDFS/MapRFS so that you can manipulate the filesystem natively without having to do anything special. It gets treated like any other NFS filesystem, except in this case, it's distributed across your cluster.
I am currently "playing around" with Hadoop in a VM (CDH4.1.3 image from cloudera). What I am wondering about is the following (and the documentation did not really help me in that regard).
Following the tutorial, I would format a NameNode first - OK, that is already done if one uses the cloudera image. Likewise the HDFS file structure is already present. In the hdfs-site.xml the datanode data dir is set to:
/var/lib/hadoop-hdfs/cache/${user.name}/dfs/data
which is obviously where the blocks are supposed to be copied to in a real distributed setting. In the cloudera tutorial, one is told to create hdfs "home directories" for each user (/users/<username>), which I do not understand what they are for. Are they just for local test-runs in a single-node setup?
Say I really had petabytes of data on type not fitting into my local storage. This data would have to be distributed straight away, rendering a local "home directory" entirely useless.
Could someone tell me, just to give me an intuition, how a real Hadoop workflow with massive data would look like? What kind of distinct nodes would I have running for a start?
There's the master (JobTracker) with its slave file (where would I put that) allowing the master to resolve all the DataNodes. Then there is my NameNode that keeps track of where the block IDs are stored. The DataNodes are also carry TaskTracker responsibility. In the config files, the NameNode's URI is included -- am I correct so far? Then there is still the ${user.name} variable in the configuration which apparently, if I understood it right, has something to do with WebHDFS, which would also be great if someone could explain to me. In the running examples, the directions tend to be hardcoded to
/var/lib/hadoop-hdfs/cache/1/dfs/data, /var/lib/hadoop-hdfs/cache/2/dfs/data and so on.
So, back to the example: Say, I have my tape and want to import data into my HDFS (and I am required to stream data into the filesystem because I lack the local storage to save it locally on a single machine). Where would I start with the migration process? On an arbitrary DataNode? On the NameNode that distributes the chunks? After all, I cannot assume the data just to "be there", because the name node has to be aware of the block IDs.
It would be great if someone could shortly elaborate on these topics:
What is the home directory really for?
Do I migrate data to the home directory first and to the real distributed system afterwards?
How does WebHDFS work and what role does it play with regards to the user.name variable
How would I migrate "big data" into my HDFS on the fly - or even if it's not big data, how do I populate my file system in a proper way (meaning, that the chunks get randomly distributed across the cluster?
What is the home directory really for?
You have a small confusion here. Just like /home exists for local filesystems on Linux, where users are given their own storage space, /users is a home mount ON the HDFS (Distributed FS). The tutorial needs you to administratively create a home directory for the user you wish to later be running data loads and queries as, such that they get adequate permissions and storage access onto the HDFS. The tutorial is not asking you to create these directories locally.
Do I migrate data to the home directory first and to the real distributed system afterwards?
I believe my above answer should clarify this for you. You should create your home directory on the HDFS, and then load all your data inside of that directory.
How does WebHDFS work and what role does it play with regards to the user.name variable
WebHDFS is one of the various ways to access HDFS. Regular clients to talk to HDFS require use of Java APIs. WebHDFS (and also HttpFs) techniques were added to HDFS to let other languages have their own set of APIs by providing a REST front-end to HDFS. WebHDFS allows user-authentication, to help persist the permission and security models.
How would I migrate "big data" into my HDFS on the fly - or even if it's not big data, how do I populate my file system in a proper way (meaning, that the chunks get randomly distributed across the cluster?
The large part of problem HDFS solves for you is that of managing distribution of data. When loading files or data streams to HDFS (via CLI tools, sinks from Apache Flume, etc.), the blocks are spread in an ideal distribution by HDFS itself, and the chunking is managed by it as well. All you need to do is use the user-side regular FileSystem style APIs and forget about what goes where underneath - its all managed for you.