Hadoop distribution - hadoop

I was using IBM big insights via VNC software (remote access) provided by the university I study but I can't access Internet through that desktop. To use some data samples available in internet, I decided to install Hadoop in my laptop (single cluster), but I found that there are many distributions, So What's the best free Hadoop distribution for training as a beginner ?
1) Amazon Elastic MapReduce
2) Cloudera CDH Hadoop Distribution
3) Hortonworks Data Platform (HDP)
4) MapR Hadoop Distribution
5) IBM Open Platform
6) Microsoft Azure's HDInsight -Cloud based Hadoop Distrbution
7) Pivotal Big Data Suite
8) Datameer Professional
9) Datastax Enterprise Analytics
10) Dell- Cloudera Apache Hadoop Solution.

CDH and Hortonworks are the easiest to get a single node cluster up and running, and are also very widely used so you can find a lot of troubleshooting resources.
If you just want to write application code/run arbitrary MapReduce jobs rather than learn the Hadoop systems architecture, then Amazon EMR is more suitable.

Related

Setup Marklogic connector for hadoop in Windows machine?

I need to setup MarkLogic connector for Hadoop for sending the ml files to HDFS storage and retrieving them.
I went through one of the ML document where they had mentioned in required software section it requires Linux:
The MarkLogic Connector for Hadoop is a Java-only API and is only available on Linux. You can use the connector with any of the Hadoop distributions listed below. Though the Hadoop MapReduce Connector is only supported on the Hadoop distributions listed below, it may work with other distributions, such as an equivalent version of Apache Hadoop.
So, does this mean I can't achieve this on a Windows machine?

Difference Between typical Hadoop Architecture and MapR architecture

I know that Hadoop is based on Master/Slave architecture
HDFS works with NameNodes and DataNodes
and MapReduce works with jobtrackers and Tasktrackers
But I can't find all these services on MapR, I find out that it has its own Architecture with its own services
I'm a little bit confused, could any one please tell me what is the difference between using Hadoop only and using it with MapR !
You have to refer to Hadoop 2.x latest architecture since YARN ( Yet Another Resource Negotiator) & High Availability have been introduced in 2.x version.
Job tracker and Task tracker are replaced with Resource Manager, Node Manager and Applications Manager.
Hadoop 2.x YARN & High Availability
For MapR architecture, refer to MapR article
For comparison between different distributors, refer to this image
Detailed comparison is available at Data-magnum article by Bill Vorhies
MapR and apache Hadoop DO NOT have same architecture at storage level. MapR uses its own filesystem MaRFS which is completely different from HDFS in terms of concept and implemenation . you can find more detailed comparision here : https://www.mapr.com/blog/comparing-mapr-fs-and-hdfs-nfs-and-snapshots#.VfGwwxG6eUk
https://www.mapr.com/resources/videos/comparison-mapr-fs-and-hdfs
Mapr uses most of Apache bigdata distributions as their baseline.
Mapr is a hadoop (and bigdata technology stacks) distribution provider with certain add-ons and technical support to its client.
Underline the mapr is entirely on the same architecture as of apache hadoop including all the core library distribution. However mapr distribution is more like a bundle of a complete and compatible bigdata technology package.
The main benefit of mapr is that it's distribution of various technologies like hive, hbase, spark etc will be compatible with core hadoop and among each other. This I'd particularly important because the bigdata technologies are evolving in different pace and hence news releases becomes incompatible very soon.
So, the vendors like mapr, cloudera etc are providing their version of hadoop didtribution and support such that end users can concentrate on the product building without worrying about the compatibility issues. But almost all of them are using apache distribution under the carpet.
In future, they might come up certain variation and additional features in an attempt to prevent client's switch to other vendors, but as of now is not the case.

How to install Apache Spark on HortonWorks HDP 2.2 (built using Ambari)

I successfully built a 5 node cluster of HortonWorks HDP 2.2 using Ambari.
However I don't see Apache Spark in the installed services list.
I did some research and found that Ambari does not install certain components like hue etc. ( Spark was not in that list, but I guess its not installed).
How do I do a manual install of Apache spark on my 5 node HDP 2.2?
Or should I delete my cluster and perform a fresh install without using Ambari?
Hortonworks support for Spark is arriving but not fully complete (details and blog).
Instructions for how to integrate Spark with HDP can be found here.
You could build your own Ambari Stack for Spark. I recently did just that, but I cannot share that code :(
What I can do is share a tutorial I did on how to do any stack for Ambari, including Spark. There are many interesting issues with Spark that need to be addressed and are not covered through the tutorial. Anyways hope it helps. http://bit.ly/1HDBgS6
There is also a guide from the Ambari people here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=38571133.
1) Ambari 1.7x does not install Accumulo, Hue, Ranger, or Solr services for the HDP 2.2 Stack.
For Installing Accumulo, Hue, Knox, Ranger, and Solr services, install
HDP Manually.
2) Apache Spark 1.2.0 on YARN with HDP 2.2 : here .
3)
Spark and Hadoop: Working Together :
Standalone deployment: With the standalone deployment one can statically allocate resources on all or a subset of machines in a Hadoop cluster and run Spark side by side with Hadoop MR. The user can then run arbitrary Spark jobs on her HDFS data. Its simplicity makes this the deployment of choice for many Hadoop 1.x users.
Hadoop Yarn deployment: Hadoop users who have already deployed or are planning to deploy Hadoop Yarn can simply run Spark on YARN without any pre-installation or administrative access required. This allows users to easily integrate Spark in their Hadoop stack and take advantage of the full power of Spark, as well as of other components running on top of Spark.
Spark In MapReduce : For the Hadoop users that are not running YARN yet, another option, in addition to the standalone deployment, is to use SIMR to launch Spark jobs inside MapReduce. With SIMR, users can start experimenting with Spark and use its shell within a couple of minutes after downloading it! This tremendously lowers the barrier of deployment, and lets virtually everyone play with Spark.

What is Hue all about?

I am new to Big Data. I want to know about Hue. All i know about Hue is that it is a web interface to manage Hadoop ecosystem. Please let me know if i can install in on my pc (Ubuntu Precise). I am running Apache Hadoop 1.2.1 in pseudo distributed mode with PIG and HIVE
Thanks in Advance
Hue is a Web interface for analyzing data with Apache Hadoop. You can install it in any pc with any hadoop version.
Hue is a suite of applications that provide web-based access to CDH components and a platform for building custom applications.
The following figure illustrates how Hue works. Hue Server is a "container" web application that sits in between your CDH installation and the browser. It hosts the Hue applications and communicates with various servers that interface with CDH components.
here you have all the explanations about hue and downloads:
http://gethue.com/

What are sites for Hadoop Best practices

What are sites for Hadoop Best practice , Not the Books where I can get the step by step process to create new projects and small examples . I am not able to find a single site like this , please share.
There is an awesome article from yahoo developers on Apache Hadoop: Best Practices and Anti-Patterns
Hadoop is not something one single application instead it is a distributed processing framework which is used by several applications which sits top of this framework. Pig, Hive, HBase, Cassandra, etc are few of many such application designed for specific requirement. Underneath all of these application consume Hadoop framework which mainly consist of distributed file system (HDFS) and distributed processing (MapReduce).
Technically when you have a bare minimum Hadoop cluster (HDFS + MapReduce only) you can start writing MapReduce based applications (in Java or other languages are supported through Hadoop Streaming) to process some data.
What you could do is first download a pre-build/configured Hadoop virtual Image from Cloudera or Hortonworks distribution and get it running in your machine. After that start learning writing MapReduce jobs in Java and run in your virtual machine.
Here is the URL to download Cloudera Hadoop Distribution VM
Here is the link to learn writing simplest wordcount job.

Resources