How to create HDFS for a text file in Apache Spark? - hadoop

This is the first time I deal with big data and use cluster.
In order to distribute the bytes among the slaves nodes I read that it is easy to use the HDFS with Apache spark.
How to create HDFS?

You can use Apache Spark to process files even without HDFS if you just want to experiment and learn. Take a look at the Spark Quick Start.
If you do need an HDFS cluster, your best bet is installing one of the big Hadoop distributions. Cloudera, Hortonworks, MapR all provide easy and free installers and paid support services.

Related

data backup and recovery in hadoop 2.2.0

I am new to Hadoop and much interested in Hadoop Administration,so i tried to install Hadoop 2.2.0 in Ubuntu 12.04 as pseudo distributed mode and installed successfully and run some example jar files also ,now i am trying learn further ,trying to learn data back up and recovery part now,can anyone tell ways to take data back back up and recovery it in hadoop 2.2.0 ,and also please suggest any good books for Hadoop Adminstration and steps to learn Hadoop Adminstration.
Thanks in Advance.
There is no classic backup and recovery functionality in Hadoop. There are several reasons for this:
HDFS uses block level replication for data protection via redundancy.
HDFS scales out massively in size, and it is becoming more economic to backup to disk, rather than tape.
The size of "Big Data" doesn't lend itself to being easily backed up.
Instead of backups, Hadoop uses data replication. Internally, it creates multiple copies of each block of data (by default, 3 copies). It also has a function called 'distcp', which allows you to replicate copies of data between clusters. This is what's typically done for "backups" by most Hadoop operators.
Some companies, like Cloudera, are incorporating the distcp tool into creating a 'backup' or 'replication' service for their distribution of Hadoop. It operates against a specific directory in HDFS, and replicates it to another cluster.
If you really wanted to create a backup service for Hadoop, you can create one manually yourself. You would need some mechanism of accessing the data (NFS gateway, webFS, etc), and could then use tape libraries, VTLs, etc. to create backups.

How many HBase servers should I have per Hadoop server?

I have a system which will feed smaller image files which are stored in an HBase table which uses hadoop for the file system.
I have 2 instances of hadoop currently and 1 instance of HBase, but my question is what should the ratio here be? SHould I have 1 hadoop per hbase server or does it really matter?
Answer is it depends.
It depends how much data you have, cpu utilization of regionserver and various other factors. You need to do some Proof of concepts to realise the sizing of your hadoop and hbase cluster. Variability of using hadoop and hbase depends on use-cases.
As a matter of fact, I have recently seen a setup where hadoop and hbase cluster totally decoupled. In the setup hbase cluster remotely uses hadoop to R/W on HDFS.

Is it possible to combine mapR to pure apache hadoop?

I'm newbie on hadoop.
I heard that mapR is better way to mount hadoop HDFS rather than fuse.
But most of the related article just describe about mapR hadoop not pure apache hadoop.
Anyone has experience of mounting pure apache hadoop with mapR?
Thanks in advance.
MapR is much more than just a way to mount HDFS.
MapR includes Hadoop and many Apache eco-system components and many other non-Apache components such as Cascading. It also includes LucidWorks which includes Solr.
MapR also includes a reimplementation of HDFS called MaprFS. MaprFS has higher performance, has read-write semantics, allows read during write, supports transactionally correct mirrors and snapshots, has no name node, scales without the futzing of federation, is inherently HA without all the mess of the HA NameNode and which is accessible via a distributed NFS system.
Oh, MaprFS also supports the HBase API in addition to POSIX-ish access via NFS and in addition to the HDFS API.
The map-reduce layer in MapR has been partially re-written to make use of the extremely high performance capabilities of the file system. This is how MapR was able to break the minute sort record last fall.
So naming aside, MapR includes all the open source software that you would get with any other distribution and much more besides. "Pure Hadoop" is next to useless. You need Pig and/or Hive. You probably should look into Cascading/Scalding. You may need Mahout. You definitely will need to connect your system to legacy data sources and reporting systems which is what NFS makes easy.
Keep in mind that mounting HDFS via NFS or Fuze doesn't get you where you want to be. HDFS just doesn't have suitable semantics for access via NFS or normal file system API's. It just has too many compromises.
With MapR, on the other hand, you can even run databases like MySQL or Postgress on top of the clusters file system via NFS.
MapR comes in three editions.
M3 is free and gives you all the performance and scalability, but limits you to a single NFS server and no mirrors, snapshots, volume locality or HBase compatible API (you can run HBase itself, of course). HA is also degraded in M3 so that it takes an hour to fail over certain functions.
M5 costs money after the free trial period and gives you snapshots, mirrors, the ability to force some data to different topologies and unlimited NFS servers.
M7 also costs money and adds the HBase API to all that M5 can do.
See mapr.com for more info.
To sum up what Ted said as well,
You're not really "mounting pure apache hadoop with mapR?". Hadoop shouldn't be confused with HDFS. While they tend to be interchangeable during conversation, HDFS explicitly refers to the actual distributed filesystem (hence the DFS in HDFS). HDFS has to be interacted with using specific hadoop commands, i.e. "hadoop dfs ls /" will list the root contents of hdfs.
MapR went above and beyond what hadoop provides you be default. One, you can interact with the filesystem using the more efficient maprfs (a rewrite of hdfs). The other thing you can do is actually NFS mount the HDFS/MapRFS so that you can manipulate the filesystem natively without having to do anything special. It gets treated like any other NFS filesystem, except in this case, it's distributed across your cluster.

What are sites for Hadoop Best practices

What are sites for Hadoop Best practice , Not the Books where I can get the step by step process to create new projects and small examples . I am not able to find a single site like this , please share.
There is an awesome article from yahoo developers on Apache Hadoop: Best Practices and Anti-Patterns
Hadoop is not something one single application instead it is a distributed processing framework which is used by several applications which sits top of this framework. Pig, Hive, HBase, Cassandra, etc are few of many such application designed for specific requirement. Underneath all of these application consume Hadoop framework which mainly consist of distributed file system (HDFS) and distributed processing (MapReduce).
Technically when you have a bare minimum Hadoop cluster (HDFS + MapReduce only) you can start writing MapReduce based applications (in Java or other languages are supported through Hadoop Streaming) to process some data.
What you could do is first download a pre-build/configured Hadoop virtual Image from Cloudera or Hortonworks distribution and get it running in your machine. After that start learning writing MapReduce jobs in Java and run in your virtual machine.
Here is the URL to download Cloudera Hadoop Distribution VM
Here is the link to learn writing simplest wordcount job.

HBase and Hadoop

HBase requires Hadoop installation based on what I read so far. And it looks like HBase can be set up to use existing Hadoop cluster (which is shared with some other users) or it can be set up to use dedicated Hadoop cluster? I guess the latter would be a safer configuration but I am wondering if anybody has any experience on the former (but then I am not very sure my understanding of HBase setup is correct or not).
I know that Facebook and other large organizations separate their HBase cluster (real time access) from their Hadoop cluster (batch analytics) for performance reasons. Large MapReduce jobs on the cluster have the ability to impact performance of the real-time interface, which can be problematic.
In a smaller organization or in a situation in which your HBase response time doesn't necessarily need to be consistent, you can just use the same cluster.
There aren't many (or any) concerns with coexistence other than performance concerns.
We've set it up with an existing Hadoop cluster that's 1,000 cores strong. Short answer: it works just fine, at least with Cloudera CH2 +149.88. But by Hadoop version, your mileage may vary.
In a distributed mode Hadoop is used for its HDFS storage. HBase will store HFile on HDFS, and thus get benefits from replication strategies and data-locality principles brought by datanodes.
RegionServer are about to basically handle local data, but still might have to fetch data from other datanodes.
Hope that will help you to understand why and how hadoop is used with HBase.

Resources