I have spend the past 8 hours trying to set up my Hadoop cluster, and to be honest, its getting exhausting. Its not just today. Its been a few weeks to be exact. I have tried probably 20-30 different tutorials i acquired on the web, and each time, i get errors towards the end...like SSH connection issues or JVM failure or PATH issues or the worst of all (WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable) ...and many more. All that leading me to reset my VMplayer and reinstall Ubuntu over and over again.
I am working towards the Hadoop Developer certification, and i need hands on experience on everything from MapReduce & Hadoop to the Eco-System(Hive, PIG, etc). The only thing in my way right now is setting up the cluster for practice. I have run out of options
My question is is there any way(easier the better, but any will do) to install Hadoop MapReduce Version 2(YARN) without pulling my hair out? I would really like something that has been shown to be consistent and has worked for multiple people.
64 bit on Ubuntu
EDIT: Thanks to everyone in advance
You haven't stated if you were using vanilla Hadoop vs a distribution. If you're using the vanilla Apache Hadoop version, you may want to try a distribution like CDH.
The CDH5B2 Documentation specifically addresses how to perform the installs in Ubuntu. The distribution contains YARN, Spark, Hive, Pig, Sqoop, Flume, etc., so it should fit all of your needs.
Thank you for that. You set me down the right path.
For those interested in development, go here
http://www.cloudera.com/content/support/en/downloads/download-components/download-products.html?productID=F6mO278Rvo
It will save you a lot of pain.
Hadoop Video description
http://www.youtube.com/watch?v=o19zLaTuuSk
Related
I have watched hundreds of videos and red hundreds of articles. but they are all so complicated? Why people cant clearly make a good introduction of what is something before breaking it down? What the hell is Hadoop? I get it is some kind of file distributing system, it has cool features like high performance, HDFS, YARN, MapReduce , Hadoop Common bla bla. Please someone, tell me what is it? is it a software like Visual Studio, Anaconda Navigator, Android Studio or what? or is it a huge company that has thousands of data servers where you can upload your company's data and manage it perfectly over there? Why these videos on YouTube say that Hadoop is storage efficient? does it mean you use Hadoop's data servers and they save your data efficiently? I am absolutely sure that I am not the only one who asks these questions when they are watching these videos on YouTube.
Thanks In advance!
Hadoop is an ecosystem, means consist of several software which can work together in a distributed mode.
Three main software inside Hadoop which provided as service are : HDFS, YARN, MapReduce.
HDFS is a distributed file system. It means HDFS is a filesystem something like our PC filesystem (NTFS, FAT32, EXT4, etc.). The main goal of a file system is management of files. each file in a file system consist of blocks (consider a file is divided to chunks). our local filesystem remain these block on one machine but HDFS split(and distribute) them on several machine.
YARN is a resource manager like our resource manager inside our OS, but it manages resource from several machines. When you run an application (i.e. Notepad) inside your PC, your OS gives your application CPU core (when needed) and Memory. In Yarn when you submit and application it gives your application CPU core (when needed) and Memory on all machines and your application must know how to work in distributed mode.
MapReduce help you to write programs in distributed mode and you don't need know how to write your application in distributed mode. MapReduce is something like a programming language plus a compiler which Yarn can know how to run its program.
One day google decided to introduce a distributed file server,
and named it Google File System, or GFS.
This idea has published in 2003 a paper: [https://static.googleusercontent.com/media/research.google.com/en/ir/archive/gfs-sosp2003.pdf]
At the same time, Doug Cutting, the creator of Apache Lucene, were working on apache Nutch, this paper helped them to implement a nice distributed file system named NDFS or Nutch Distributed File System.
The, Nutch were able to split files in multiple blocks and store them in multiple servers and replicate them to preserver data reliability.
But this data needs to be processed, and you can simply fetch terabytes of data on a single machine and process them.
in 2004, Google again came with a great idea named Map Reduce.
[http://research.google.com/archive/mapreduce.html]
with Map Reduce, you can put your data processing codes near your data blocks, process them locally (map phase) then gather results and sum them up (reduce phase).
now Nutch has both ndfs and map reduce.
quoting from the book, Hadoop definitive guide by Tom White:
NDFS and the MapReduce
implementation in Nutch were applicable beyond the realm of search,
and in February 2006 they moved out of Nutch to form an independent
subproject of Lucene called Hadoop.
this is story of hadoop.
but now, other tools has came along to co-work with Hadoop
like Apache Hive which helps us to run SQL like queries on Big and distributed data on Hadoop File System (now named HDFS).
or Apache Pig that uses a high level language to run complex analysis on Hadoop stored files using map reduce framework.
there is much other tools like Hbase, Spark, Flume, Sqoop, Crunch, etc.
we dont call them Hadoop! but they live in a Hadoop centered ecosystem.
so we call this community, Hadoop Ecosystem.
Its better if you start reading the book: Hadoop definitive guide by Tom White
instead of reading multiple blog posts and YouTube videos.
I want to install hadoop, pig and hive in my laptop. I don't know how to install and configure hadoop,pig and hive and what software are required to do it.
Please let me know exact steps require to install/configure Hadoop, Pig and hive in laptop.
and i can use windows OS and i install the hadoop in windows OS
For beginners, I would recommend sticking to a good prepackaged Hadoop distribution/sandbox. Even if you want to learn how to setup up a Hadoop cluster before using the tools it provides (e.g. Hive etc.), setting up a common distribution is a lot easier at least in the beginning.
Prepackaged sandboxes for Hadoop are going to be in Linux. But most likely, you will not need to do a lot in Linux to start using Hadoop if you start from these sandboxes. Personally, I think the time you will save by avoiding support and documentation issues on Windows ports will compensate greatly for any added effort required for jumping into Linux, and you will at least enter the domain of Linux which itself is a tremendously important tool.
For prepackaged solutions, you may try to aim at Cloudera quickstart VM or MapR quickstart VM as these are the most widely used distributions. By using sandboxes, you will skip the installation process (which may be hectic if you don't know what you want and specially if you aren't familiar with Linux) and jump right into usage of tools. Due to availability of good documentation for large vendors such as Cloudera and MapR, you will also face lesser issues in accessing the tools you want to learn.
Follow the vendor specific setup guidelines (also listed on the download pages as getting started guides) for further details on setting up the sandbox.
Once you have the sandbox setup, you can use a lot of different ways to access Hive and Pig. You can use a command line interface for Hive (called beeline). If you are familiar with JDBC, you can access Hive through that. Install Apache-Thrift to enable much wider access options, but you can also save that for later.
I would not recommend learning Pig unless you have very specific uses for it. If you are familiar with Java (or Scala, or even Python, among other options), try writing some Map-Reduce style jobs to learn more about how Hadoop works. Open Ambari (or Cloudera Manger etc.) interface which comes pre-configured with these sandboxes and see the tools and services that come pre-packaged with the sandbox. These are the most common ones and can be used as a useful list for starters. Start learning about them (but skip Pig if you can, even if it is pre-installed ;)
Once you are familiar with the sandbox you have, I would suggest going for Apache Nifi which has easier learning curve and give a lot of flexibility. But you will most likely have to setup a new sandbox for that. It may also serve as a good revision exercise for learning. Integrate that with your Hadoop sandbox, implement some decent use cases and you will have some good experience to show.
I have a test for my Big Data class where I have to do some sort of big data analytics with 'smaller' datasets. I actually have my stuff figured it out. I installed Hadoop 2.8.1 and Spark 2.2.0 (I use PySpark to build a program) in standalone mode on my Ubuntu 16.04 from source. I'm actually good to go to do my thing by my own.
The thing is, some of my friends are struggling in configuring all of these and I thought to myself "why don't I make my own little cluster with my classmates". So I'm looking for suggestions.
My laptop has 12 GB RAM and Intel Core i5.
If I understand correctly, your friends have trouble setting up spark in standalone mode (meaning no cluster at all, just local computation). I don't think setting up a cluster they can work with takes away from the complexity they will face. Or are they trying to set up a cluster? Because standalone mode of Spark really doesn't need much configuration.
Another approach is to use a preconfigured VM everyone can use individually. Either prepared by yourself, or there are sandboxes by different providers, e.g. Cloudera and Hortonworks.
I am actively looking for a way to contribute in hadoop development. I want to run an user-defined scheduling algorithm in hadoop cluster and want to see its performance. I have already configured hadoop cluster and have gone through Yarn Scheduler Load Simulator (SLS). Now, for executing other user-defined algorithms and check their performances respect to regular schedulers (Fair/Capacity), what procedures/steps I need to follow? Which one will be better, simulator or regular cluster? I am noob in hadoop and badly need help for this.
I have a question regarding speed & performance of
using multiple virtualized nodes in a single machine VS single node on the single machine itself.
which one will perform better?
The reason why I ask this question is because I am currently learning hadoop on a single machine, and I see some tutorials on the internet that shows the use of multiple virtualized nodes in a single machine.
Thank you in advance
There is always some overhead that comes with virtualization, so unless really necessary I wouldn't advise to run Hadoop in a virtualized environment.
That being said, I know VMWare did a lot of work on making Hadoop work in a virtualized environment, and they have published some benchmarks in which they claim under certain conditions to have better performance with VMs that a native application. I haven't played much with vSphere, but this could be something to look at if you want to explore virtualization further. But don't take the numbers for granted, it really depends on the type of hardware you're running, so in some conditions I think you might gain some performance with VMs, but I'm guessing from experience that in most cases you won't gain anything.
If you're just getting started and testing with Hadoop, I think virtualizing is overkill. You can very easily run Hadoop in pseudo-distributed mode, which means that you can run multiple Hadoop daemons on the same box, each as a separate process. That's what I used to get started with Hadoop, and it's a good head start. You can find more info here (or might need another page depending on which Hadoop version you're running).
If you get to the point where you want to test with a real cluster, but don't have the resources, I would advise looking at Amazon Elastic Map/Reduce: it gives you a cluster on demand and it's pretty cheap. That way you can do more advanced tests. More info here.
the bottom line is, I think if the purpose is simply testing, you don't really need a virtual cluster.
A performance analysis case study conducted on this topic showed that a virtual Hadoop cluster is only around 4% less efficient compared to its native counterpart: Virtualized hadoop performance case study