GlusterFS as the backend for Hadoop - hadoop

I've seen redhat has come up one possible solution with GlusterFS working as the backend for hadoop. In this case, you can get ride of the namenode/datanode architecture and replace it with glusterfs, meanwhile you still have Hadoop Mapreduce api-compatibility.
Just wondering how does the performance compare against native-HDFS? Is it really production ready? Does it support all the hadoop ecosystem as well? e.g. Solr Cloud, Spark, Impala etc etc.

disclaimer: I work for Storage vendor.
Well. I don't know much about GlusterFS in particular but i can speak about Lustre as it's POSIX at the end of the day. It's parallel filesystem, but the benchmarks i looked into recently showed it does outperform HDFS. but it's definitely a production ready alternative that offers a single name space for your data (no more HDFS ingestion)
What does work from Hadoop ecosystem today?
what I've seen in the production today is Spark,Hive,Hbase. Imapala looks to me it require certain parts of HDFS, this is why it doesn't work with POSIX FS and it's not HCFS. I did a quick test and i was able to create the database and everything, but i wasn't able to fetch any rows.
Let me if you need further help.

Related

data backup and recovery in hadoop 2.2.0

I am new to Hadoop and much interested in Hadoop Administration,so i tried to install Hadoop 2.2.0 in Ubuntu 12.04 as pseudo distributed mode and installed successfully and run some example jar files also ,now i am trying learn further ,trying to learn data back up and recovery part now,can anyone tell ways to take data back back up and recovery it in hadoop 2.2.0 ,and also please suggest any good books for Hadoop Adminstration and steps to learn Hadoop Adminstration.
Thanks in Advance.
There is no classic backup and recovery functionality in Hadoop. There are several reasons for this:
HDFS uses block level replication for data protection via redundancy.
HDFS scales out massively in size, and it is becoming more economic to backup to disk, rather than tape.
The size of "Big Data" doesn't lend itself to being easily backed up.
Instead of backups, Hadoop uses data replication. Internally, it creates multiple copies of each block of data (by default, 3 copies). It also has a function called 'distcp', which allows you to replicate copies of data between clusters. This is what's typically done for "backups" by most Hadoop operators.
Some companies, like Cloudera, are incorporating the distcp tool into creating a 'backup' or 'replication' service for their distribution of Hadoop. It operates against a specific directory in HDFS, and replicates it to another cluster.
If you really wanted to create a backup service for Hadoop, you can create one manually yourself. You would need some mechanism of accessing the data (NFS gateway, webFS, etc), and could then use tape libraries, VTLs, etc. to create backups.

Is it possible to combine mapR to pure apache hadoop?

I'm newbie on hadoop.
I heard that mapR is better way to mount hadoop HDFS rather than fuse.
But most of the related article just describe about mapR hadoop not pure apache hadoop.
Anyone has experience of mounting pure apache hadoop with mapR?
Thanks in advance.
MapR is much more than just a way to mount HDFS.
MapR includes Hadoop and many Apache eco-system components and many other non-Apache components such as Cascading. It also includes LucidWorks which includes Solr.
MapR also includes a reimplementation of HDFS called MaprFS. MaprFS has higher performance, has read-write semantics, allows read during write, supports transactionally correct mirrors and snapshots, has no name node, scales without the futzing of federation, is inherently HA without all the mess of the HA NameNode and which is accessible via a distributed NFS system.
Oh, MaprFS also supports the HBase API in addition to POSIX-ish access via NFS and in addition to the HDFS API.
The map-reduce layer in MapR has been partially re-written to make use of the extremely high performance capabilities of the file system. This is how MapR was able to break the minute sort record last fall.
So naming aside, MapR includes all the open source software that you would get with any other distribution and much more besides. "Pure Hadoop" is next to useless. You need Pig and/or Hive. You probably should look into Cascading/Scalding. You may need Mahout. You definitely will need to connect your system to legacy data sources and reporting systems which is what NFS makes easy.
Keep in mind that mounting HDFS via NFS or Fuze doesn't get you where you want to be. HDFS just doesn't have suitable semantics for access via NFS or normal file system API's. It just has too many compromises.
With MapR, on the other hand, you can even run databases like MySQL or Postgress on top of the clusters file system via NFS.
MapR comes in three editions.
M3 is free and gives you all the performance and scalability, but limits you to a single NFS server and no mirrors, snapshots, volume locality or HBase compatible API (you can run HBase itself, of course). HA is also degraded in M3 so that it takes an hour to fail over certain functions.
M5 costs money after the free trial period and gives you snapshots, mirrors, the ability to force some data to different topologies and unlimited NFS servers.
M7 also costs money and adds the HBase API to all that M5 can do.
See mapr.com for more info.
To sum up what Ted said as well,
You're not really "mounting pure apache hadoop with mapR?". Hadoop shouldn't be confused with HDFS. While they tend to be interchangeable during conversation, HDFS explicitly refers to the actual distributed filesystem (hence the DFS in HDFS). HDFS has to be interacted with using specific hadoop commands, i.e. "hadoop dfs ls /" will list the root contents of hdfs.
MapR went above and beyond what hadoop provides you be default. One, you can interact with the filesystem using the more efficient maprfs (a rewrite of hdfs). The other thing you can do is actually NFS mount the HDFS/MapRFS so that you can manipulate the filesystem natively without having to do anything special. It gets treated like any other NFS filesystem, except in this case, it's distributed across your cluster.

Can some explain the Hadoop stack to me?

I am looking to understand and probably play with Hadoop and am looking at the open source projects from facebook here. There seems to be way too many to many to wrap my head around. If some one can explain where and how each of these projects fit that would be a great help.
As some background I am thinking about working on a project where the primary driver is images. So want to start things off right when picking a platform (solution). So please feel free to suggest any other technologies as well.
Cloudera has a table that gives equivalents of core Hadoop projects in terms of the Google stack:
MapReduce | MapReduce
GFS | HDFS
BigTable | HBase
Chubby | ZooKeeper
Sawzall | Hive, Pig
These, and particularly the first four, are the core components others build on. MapReduce spawns workers as close as possible to the data they will work on. HDFS replicates unstructured data. HBase is a column store. ZooKeeper does service discovery, locking, and leader election. Hive and Pig are high-level query languages, which are implemented as MapReduce computations over HBase data.
There is a lot more to the project ecosystem, from self-contained tools like Avro (serialisation, think protocol buffers), toolkits like Mahout (machine learning), to full-featured products like Nutch (crawler and search engine from which Hadoop was spun off).
Integrators are making distributions of Hadoop and Hadoop-like stacks (Hadoop is loosely coupled and some provide alternatives to important components); the core projects are maintained by the Apache foundation.
I wrote an article on this very topic last month:
The Hadoop Universe
I think it explains all the Hadoop-related Apache projects reasonably, in a paragraph each.
Hadoop ecosystem is growing at a very fast pace. There are open source (like Cloudera)/commercial (like MapR) softwares. Start with the Hadoop ecosystem world map and go to the next level as required. The article is a bit outdated, but is relevant.

Variants of Hadoop

A project of mine is to compare different variants of Hadoop, it is said that there are many of them out there, but googling didn't work well for me :(
Does anyone know any different variants of Hadoop? The only one I found was Haloop.
I think the more generic term is "map reduce":
http://www.google.com/search?gcx=c&sourceid=chrome&ie=UTF-8&q=map+reduce&safe=active
Not exactly sure what you mean by different variants for Hadoop.
But, there are a lot of companies providing commercial support or providing their own versions of Hadoop (open-source and proprietary). You can find more details here.
For ex., MapR has it's own proprietary implementation of Hadoop, but they claim it's compatible with Apache Hadoop, which is a bit vague because Apache Hadoop is evolving and there are no standards around Hadoop API. Cloudera has it's own version of Hadoop CDH which is based on the Apache Hadoop. HortonWorks has been spun from Yahoo, which provides commercial support for Hadoop.
You can find more information here. Hadoop is evolving very fast, so this might be a bit stale.
This can refer to
- hadoops file system,
- or its effective support for map reduce...
- or even more generally, to the idea of cloud / distributed storage systems.
Best to clarify what aspects of hadoop you are interested In.
Of course when comparing hadoop academically, you must first start looking at GFS- since that is the origin of hadoop.
Taking aside HBase we can see hadoop as two layers - storage layer and map-reduce layer.
Storage layer has the following really different implementation which would be interesting to compare: standard hadoop file system, HDFS over Cassandra (Brisk), HDFS over S3, MapR hadoop implementation.
MapR also have changed Map-reduce implementation.
This site http://www.nosql-database.org/ has a list of a lot of NoSql DBs out there. Maybe it can help you.

What is the best components stack for building distributed log aggregator (like Splunk)?

I'm trying to find the best components I could use to build something similar to Splunk in order to aggregate logs from a big number of servers in computing grid. Also it should be distributed because I have gigs of logs everyday and no single machine will be able to store logs.
I'm particularly interested in something that will work with Ruby and will work on Windows and latest Solaris (yeah, I got a zoo).
I see architecture as:
Log crawler (Ruby script).
Distributed log storage.
Distributed search engine.
Lightweight front end.
Log crawler and distributed search engine are out of questions - logs will be parsed by Ruby script and ElasticSearch will be used to index log messages. Front end is also very easy to choose - Sinatra.
My main problem is distributed log storage. I looked at MongoDB, CouchDB, HDFS, Cassandra and HBase.
MongoDB was rejected because it doesn't work on Solaris.
CouchDB doesn't support sharding (smartproxy is required to make it work but this is something I don't want to even try).
Cassandra works great but it's just a disk space hog and it requires running autobalance everyday to spread the load between Cassandra nodes.
HDFS looked promising but FileSystem API is Java only and JRuby was a pain.
HBase looked like a best solution around but deploying it and monitoring is just a disaster - in order to start HBase I need to start HDFS first, check that it started without problems, then start HBase and check it also, and then start REST service and also check it.
So I'm stuck. Something tells me HDFS or HBase are the best thing to use as a log storage, but HDFS only works smoothly with Java and HBase is just a deploying/monitoring nightmare.
Can anyone share its thoughts or experience building similar systems using components I described above or with something completely different?
I'd recommend using Flume to aggregate your data into HBase. You could also use the Elastic Search Sink for Flume to keep a search index up to date in real time.
For more, see my answer to a similar question on Quora.
With regards to Java and HDFS - using a tool like BeanShell, you can interact with the HDFS store via Javascript.

Resources