Which version of Hadoop should I use? - hadoop

Hadoop currently has three branches: 0.2x, 1.x, and 2.x. What are the arguments in favor of using one over another?

Hadoop recently changed its Map/Reduce implementation (now called Yarn). That may be one reason to go for a relatively new Version.
If you want to use Hadoop in conjunction with other, related projects like HBase the version vector is not quite trivial.
You may want to look at Cloudera's offering (I am not affiliated with Cloudera). They offer distributions from which you can pick your subset of tools that fit to each other. And of course they also offer professional services.

One way to deal with the way too many versions of hadoop that are available out there is to go with the Cloudera offerings. Products like these make it easier on you and you don't have to worry too much about configurations.

Related

install Hadoop,Pig and hive in laptop

I want to install hadoop, pig and hive in my laptop. I don't know how to install and configure hadoop,pig and hive and what software are required to do it.
Please let me know exact steps require to install/configure Hadoop, Pig and hive in laptop.
and i can use windows OS and i install the hadoop in windows OS
For beginners, I would recommend sticking to a good prepackaged Hadoop distribution/sandbox. Even if you want to learn how to setup up a Hadoop cluster before using the tools it provides (e.g. Hive etc.), setting up a common distribution is a lot easier at least in the beginning.
Prepackaged sandboxes for Hadoop are going to be in Linux. But most likely, you will not need to do a lot in Linux to start using Hadoop if you start from these sandboxes. Personally, I think the time you will save by avoiding support and documentation issues on Windows ports will compensate greatly for any added effort required for jumping into Linux, and you will at least enter the domain of Linux which itself is a tremendously important tool.
For prepackaged solutions, you may try to aim at Cloudera quickstart VM or MapR quickstart VM as these are the most widely used distributions. By using sandboxes, you will skip the installation process (which may be hectic if you don't know what you want and specially if you aren't familiar with Linux) and jump right into usage of tools. Due to availability of good documentation for large vendors such as Cloudera and MapR, you will also face lesser issues in accessing the tools you want to learn.
Follow the vendor specific setup guidelines (also listed on the download pages as getting started guides) for further details on setting up the sandbox.
Once you have the sandbox setup, you can use a lot of different ways to access Hive and Pig. You can use a command line interface for Hive (called beeline). If you are familiar with JDBC, you can access Hive through that. Install Apache-Thrift to enable much wider access options, but you can also save that for later.
I would not recommend learning Pig unless you have very specific uses for it. If you are familiar with Java (or Scala, or even Python, among other options), try writing some Map-Reduce style jobs to learn more about how Hadoop works. Open Ambari (or Cloudera Manger etc.) interface which comes pre-configured with these sandboxes and see the tools and services that come pre-packaged with the sandbox. These are the most common ones and can be used as a useful list for starters. Start learning about them (but skip Pig if you can, even if it is pre-installed ;)
Once you are familiar with the sandbox you have, I would suggest going for Apache Nifi which has easier learning curve and give a lot of flexibility. But you will most likely have to setup a new sandbox for that. It may also serve as a good revision exercise for learning. Integrate that with your Hadoop sandbox, implement some decent use cases and you will have some good experience to show.

What is meaning of "Hadoop distribution"

I am new to hadoop. I recently read about basics of Apache Hadoop, Pig, Hive, HBase.
Then I came across term "Hadoop distribution" and examples were Cloudera, MAPR, HortonWorks.
So what is relation of Apache Hadoop (& its echo-system ) with "Hadoop Distribution"
Is it like Java Virtual machine specification (a document) and Oracle JVM, IBM JVM (working implementation of the document) ?
But we get zips from Apache, which are actually logic implemented.
So I am bit confused.
Since Hadoop is an open source project, a number of vendors have developed
their own distributions, adding new functionality or improving the code base
Vendor distributions are, of course, designed to overcome issues with the open source edition and provide additional value to customers, with a focus on things such as:
Reliability: The vendors react faster when bugs are detected. They promptly deliver fixes and patches, which makes their solutions more stable.
Support: A variety of companies provide technical assistance, which makes it possible to adopt the platforms for mission-critical and enterprise-grade tasks.
Completeness: Very often Hadoop distributions are supplemented with other tools to address specific tasks.
Have a look at this top-hadoop-distributions article and this presentation for benchmarking analysis among top three Hadoop distributions.
Based on Distributions and Commercial Support, The following companies provide products that include Apache Hadoop, a derivative work thereof, commercial support, and/or tools and utilities related to Hadoop.
Some companies release or sell products that include the official Apache Hadoop release files, and/or their own and other useful tools. Other companies or organizations release products that include artifacts build from modified or extended versions of the Apache Hadoop source tree. Such derivative works are not supported by the Apache Team: all support issues must be directed to the suppliers themselves.

which one is the official command line package of pacemaker? crmsh or pcs?

I am working on a Linux-HA cluster with pacemaker-1.1.10-1.el6_4.4, as you know, in this pacemaker version, cluster command line functionality is not packaged with pacemaker package, I found 2 packages: crmsh and pcs, my question is which one is the official command line interface? which one is the recommendation? and what is the relation between them?
thanks,
Emre
There is no One-True-CLI for Pacemaker.
The best suggestion is to use whatever your distribution provides support for (pcs on RHEL and its clones, crmsh for SLES).
The biggest difference is that pcs can configure the entire cluster (including corosync), not just the pacemaker portion. It also doesn't try to have a 1-1 mapping between the underlying XML constructs and its command-line, which provides a certain degree of freedom to simplify things.
While there is no official relationship between the two projects, they continue to share ideas for improvements in a usability arms race :-)

Which distribution of hadoop is better?

I am working with massive data, my input data is about 100 GB.I want to choose one of the hadoop distributions, but i don't know to choose mapr cluster or cloudera cluster. i want to use free versions(mapr M3 and cloudera CDH4 that uses hadoop 0.20).
which of them is better? which configurations do i use that they work the best?
Thanks.
Actually speaking, answer to this question is the most common answer in this world, it depends. It's totally upto you and your requirements. One might find one particular flavor more suitable for his/her needs, and you might find the same flavor less useful. Moreover it's all about personal choice, like I personally like Apache's Hadoop. All are good. It's just that which one fits into your needs.
Which of them is better? is a controversial topic. Questions like this often end up as heated arguments. See this question for example. So, i'm not going to list down advantages of any one over the other. But there are certain differences among these different flavors of Hadoop which could probably help you during your thought process.
The major difference between CDH(Apache Hadoop as well) and MapR is that MapR uses its own proprietary file system, MapRFS instead of HDFS. The M3 Edition is free and available for unlimited production use. Support is provided on a community basis and through MapR's Forums. CDH is 100% open source and you can use the "Standard" version of Cloudera Manager without any charges. And Apache, well it's Apache :). Do what ever you feel like.
MapR has even partnered recently with Canonical, the organization behind the Ubuntu operating system, in an effort to make Hadoop available as an integrated part of Ubuntu through its repositories. The partnership announced that MapR's M3 Edition for Apache Hadoop will be packaged and made available for download as an integrated part of the Ubuntu operating system(see this if you need more info on this). The source code is available on Github. CDH codebase is same as Apache's, with some patches of their own.
But the free edition lacks some good features like JobTracker HA, NameNode HA, Mirroring, Snapshot etc. CDH4, being based on Hadoop-2.x provides you the HA features though. By virtue of its design MapR doesn't have any SPOF, like CDH3(or Hadoop-1.x) does. The MapRFS stores data in volumes, conceptually in a set of containers distributed across a cluster. Each container includes its own metadata, eliminating the central NameNode single point of failure. Still the API is Apache Hadoop compatible. MapR setup requirements differ from Apache/CDH. Like MapR requires raw volumes to be available for installation for example. Once you have the correct hardware & OS pre-requisites, setup times and eval times should be on the same order of magnitude as Apache/CDH.
IMHO, M3 is not gonna give you huge advantages over Apache/CDH as some of the catchy MapR features are not present in M3 free edition, like NFS-HA, Snapshots etc.
Being the first one Cloudera definitely has an extra edge in terms of experience and a solid customer base. But MapR has gone more innovative in terms of significant changes to the MapReduce and HDFS components to improve performance.
I'll write some more after sometime, as i'm on a call and you are waiting for the answer ;)

starfish or splunk

hiall
My goal is to analyze log files of Hadoop and there are two tools starfish(open source) and splunk(commercial product). Does anyone know the pros and cons as to which one to choose.
I really appreciate your answer.
Thanks
Well,
the pros and cons are the same of any open source vs commercial tool choice.
The main guideline should be, what are your prerequisites?
Splunk core is opensource, the free license allows you to index 500Mb/day,
probably its main advantage is providing a BI tool cheaper than other comercial ones,
it also has an impressive amount of plugins, including for Hadoop,
and like Hadoop relies on a (different) MapReduce implementation since Splunk 4.x.
It both has a Python and Java SDK, which may come in handy.
Its approach is, install it and after (a minimal) setup, start playing with your data.
I don't know Starfish, though it does look promissing,
it only seems to require JavaFX while Splunk comes with its own Python alternative installation.
But in the end, it all boils down to what are your most important prerequisites.
Barriers to entry is low for both. Best is to try both out for a while and see what works for you.
Depending on your use case each tool has different strengths. What is your use case?
Generally speaking Splunk is easy and modern with great community support. Answers are generally a few searches away.

Resources