How to Practice Hadoop Programming? [closed] - hadoop

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Just started going through Hadoop introduction videos.
How to practice it on your own? Is there a recommended way to install on local to practice?

I found that downloading and installing Hadoop, playing with it by working examples, making lots of mistakes and being ok with that worked well for practice.
By "install on local" if you're saying "how do I install it on my local machine without using HDFS?", there's an excellent guide here.

If you want to learn about Hadoop and Bigdata, look into bigdatauniversity.com. Its free and they give instructions on how to install Hadoop locally on a virtual machine and/or in Amazon's Web Services. BigDataUniversity provides labs and instructions to help guide your practice. I found it helpful so far.

Recently Cloudera launched a new platform online where you can play with Hadoop and its ecosystem as much as you want.Here you go -
cloudera.com/live

I have been training people on Hadoop for 2 years now. Here are my two cents.
For the learning part, I would recommend the following sources (as mentioned by others too above):
Yahoo Blog
Hadoop Definitive Guide
HortonWorks Practice Tutorials
And for practicing, traditionally people have been using Hadoop Virtual Machines but this approach has its downsides:
The VMs are huge in size for example HortonWorks' VM is 9.9 GB.
You might have to upgrade your RAM to 8GB.
Some BIOS don't allow virtualization. You might have change bios settings.
Some machines such as Office Desktops/Laptops may not allow installations.
My students and I too faced the these problems while. So, we setup a cluster for our students to practice Hadoop, Spark and related technologies. And we named it as CloudxLab.com.

...I liked bigdatauniversity.com and also noted that MapR, Hortonworks, and Cloudera all offer a downloadable environment that you can use to gain familiarity with the Hadoop operating paradigm.
In fact, if you are studying this with an eye toward working with Hadoop at an Enterprise scale, it's a good idea to explore the products that are being deployed at that level.
I've had a little chance now to explore hands-on with MapR's Hadoop environment and can commend it as a good way to looking into the matter.
---v

I would suggest https://developer.yahoo.com/hadoop/tutorial/ for hadoop self paced study. Its a very comprehensive guide, step by step, from beginner to advanced level.

You can install a virtual box that has Hadoop included but you may encounter some problems with it. I did so first when I started learning Hadoop and after several problems( IP, internet, different configs) I decided to learn with a Linux install.
You can find a tutorial here:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

Related

Hadoop and geospatial connector [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am using Cloudera Hadoop, I want to perform Spatial Analytics and need to connect to quantum GIS(QGIS) for geospatial purposes. I need to know how to connect both QGIS and Hadoop. Also is there any other way to connect any other GIS systems except ARCGIS?
There are a number of free and open-source offerings you can use to achieve your goals.
From the list of LocationTech projects, I'd note that GeoJinni (formerly Spatial Hadoop), GeoMesa, and GeoTrellis all work with Hadoop or distributed databases like Accumulo or Cassandra.
More generally, since working with Hadoop means using Java, I'd recommend the GeoTools project for processing geo/gis data on the JVM. GeoTools is used as a library for GeoServer to publish geospatial data using open standards. GeoServer and MapServer are two of the open alternatives for Arc server products.
As you are looking for alternatives to Arc desktop products, QGIS and OpenJump are both options.
As a concrete, small examples, I've used the GeoTools library to read shapefiles from HDFS for ingest into GeoMesa with no problems. Previously, I have looked at serving up geotiffs hosted on HDFS/S3 through GeoServer, and there were a few small changes necessary to wire that up through the stack, but I was able to do it since all the software involved is open-source.
(In full-disclosure, I work on GeoMesa, participate quite a bit with LocationTech projects and some with GeoTools/GeoServer.)
GeoWave is a project that matches the criteria.
The goal of the project is primarily to connect the Hadoop ecosystem with popular GIS software which seems to fit well here. It enables storage/retrieval/analysis of raster/vector/point cloud data within various distributed key-value stores.
Also, the project has installers for Cloudera in addition to other vendors - the latest release with support for Cloudera 5.12.1.
Please keep in mind that I am a GeoWave core contributor.

Is Hadoop in Docker container faster/worth it? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have a Hadoop based environment. I use Flume, Hue and Cassandra in this system. There is a big hype around Docker nowadays, so would like to examine, what are pros and cons in dockerization in this case. I think it should be much more portable, but it can be set using Cloudera Manager with a few clicks. Is it maybe faster or why is worth it? What are advantages?
Maybe should be only multi node Cassandra cluster dockerized?
Is it maybe faster or why is worth it?
It sounds like you already have a Hadoop cluster. So you have to ask yourself, how long does it take to reproduce this environment? How often do you need to reproduce this environment?
If you are not needing a way to reproduce the environment repeatedly and and contain dependencies that may be conflicts with other applications on the host, then I don't yet see a use case for you.
What are advantages?
If you are running Hadoop in an environment where you may need mixed Java versions, then running it as a container could isolate the dependencies (in this case, Java) from the host system. In some case, it would get you a more easily reproducible artifact to move around and set up. But Java apps are already so simple with all their dependencies included in the JAR.
Maybe should be only multi node Cassandra cluster dockerized?
I don't think it really comes down to whether is is a multi-node environment or not. It comes down to the problems it solves. It doesn't sound like you have any pain point in deploying or reproducing Hadoop environments (yet), so I don't see the need to "dockerize" something just because it is the hot new thing on the block.
When you do have the need to reproduce the Hadoop environment easily, you might look at Docker for some of the orchestration and management tools (Kubernetes, Rancher, etc.) which make deploying and managing clusters of applications on an overlay network much more appetizing than just regular Docker. Docker is just the tool in my eyes. It really starts to shine when you can leverage some of the neat overlay multi-host networking, discovery, and orchestration that other packages are building on top of it.

Learning Hadoop for System Admin

This is not a technical question, but want to have suggestions from more experienced people regarding my career.
I have been working as UNIX admin from past 13 years, majority of Solaris and couple of years on Linux. Now, I want to learn something more which can excel my career. I have been hearing a lot about Hadoop/Big Data from quite sometime. I do not have any programming or scripting knowledge, neither have knowledge of apache or any database.
- I am assuming that there are two different job profile, Developer and Admin. Am I understanding it correctly ?
- Do I need to learn apache, database, java to learn Hadoop (Even for Admin job profile) ?
- At my place training is expensive. if I want to start study with books, which book should I start with ? I can see popular ones are "Hadoop: The Definite Guide - O'Reilly" and also "Big Data for Dummies". (I am asking from beginners level).
Please help with my doubts. Your suggestions will help me to take decision.
(Moved from comment because too long.)
In order to administer Hadoop in any meaningful way you need to know a fair bit about (a) how Hadoop works, (b) how Hadoop runs its jobs, and (c) job-specific tuning.
I don't know what "learning Apache" means; Apache is a conglomerate of projects, unless you mean the web server itself.
"Learning databases" is too broad to be useful, and Hadoop isn't a database (HBase is).
You don't need any Java knowledge to administer a Java-based program, although knowing about JVM options, how to specify them, and generalities is certainly helpful.
There is a lot to digest, I would start very small, e.g., intro books. Also, keep in mind that there are other solutions besides Hadoop, and a lot of different ways to actually use Hadoop.
The Kiji project is a good way to get Hadoop/HBase/etc up and running, though if you're interested in doing everything "from scratch", it's not the best path.

GUI for using Hadoop [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Is there an easy way to use Hadoop other than with the command line?
Which tools are you using and which one is the best?
Hue is pretty cool, new features are regularly pushed out and it's open source.
From its website:
Hue features a File Browser for HDFS, a Job Designer/Browser for MapReduce, query editors for Hive, Pig, Cloudera Impala and Solr Search.
It also ships with an Oozie Application for creating workflows, various Shells and a collection of Hadoop API.
Although Enrico already answered the question, I would like to add few points to that.
Hue is a really amazing tool and we have been using it at Goibibo.com for last 1 year. We have exposed it to developers and business people for running their hive queries and getting results.
Also we are indexing log data so cloudera search comes in pretty handy. With the new version of Hue (3.6), you can also run queries on RDBMS data using HUE itself.
I would really recommend using it because its really simple to use and provides a GUI to mostly everything there in the bigdata ecosystem.
If you are on windows, you can use an open source project called HDFS Explorer.
If you're on a Mac or Linux, then you can mount Hadoop filesystems directly using FUSE, and then use Finder, or Nautilus, or whatever you normally use for filesystem navigation. Checkout the Hadoop wiki on how to setup the mounts http://wiki.apache.org/hadoop/MountableHDFS
Each distribution provides a web based GUI, in some cases Hue, and in others based on the new Ambari views framework, which provide access to file functionality.
You can look for some Data Integration tools like Talend, CloverETL or Pentaho. They have provided support for Hadoop.Talend has provided a vast support of it. Don't have much information about the other tools support for Hadoop.
If you are just looking for something a step up from the cli for exploring, my installation has a web server that installed with hadoop and it is accessible at :50075. The port is configurable, but give that a try.
If you are using CDH, then Hue(hadoop user Interface) comes with it. And its a very good user interface for hadoop. You can also use install separately. It supports all components of hadoop.
This is a light-weight Hadoop file manager HFX. You can use this. It has some essential features like drag and drop, upload, cut, copy and paste...

How to start learning hadoop [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am a Web developer. I have experience in Web technologies like JavaScript , Jquery , Php , HTML . I know basic concepts of C. Recently I had taken interest in learning more about mapreduce and hadoop. So I enrolled my self in parallel data processing in mapreduce course in my university. Since I dont have any prior programing knowledge in any object oriented languages like Java or C++ , how should I go about learning map reduce and hadoop. I have started to read Yahoo hadoop tutorials and also OReilly's Hadoop The Definitive Guide 2nd.Edition.
I would like you guys to suggest me ways I could go about learning mapreduce and hadoop.
Here are some nice YouTube videos on MapReduce
http://www.youtube.com/watch?v=yjPBkvYh-ss
http://www.youtube.com/watch?v=-vD6PUdf3Js
http://www.youtube.com/watch?v=5Eib_H_zCEY
http://www.youtube.com/watch?v=1ZDybXl212Q
http://www.youtube.com/watch?v=BT-piFBP4fE
Also, here are nice tutorials on how to setup Hadoop on Ubuntu
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
You can access Hadoop from many different languages and a number of resources set up Hadoop for you. You could try Amazon's Elastic MapReduce (EMR), for instance, without having to go through the hassle of configuring the servers, workers, etc. This is a good way to get your head around MapReduce processing while delaying a bit the issues of learning how to use HDFS well, how to manage your scheduler, etc.
It's not hard to search for your favorite language & find Hadoop APIs for it or at least some tutorials on linking it with Hadoop. For instance, here's a walkthrough on a PHP app run on Hadoop: http://www.lunchpauze.com/2007/10/writing-hadoop-mapreduce-program-in-php.html
Answer 1 :
It is very desirable to know Java. Hadoop is written in Java. Its popular Sequence File format is dependent on Java.
Even if you use Hive or Pig, you'll probably need to write your own UDF someday. Some people still try to write them in other languages, but I guess that Java has more robust and primary support for them.
Most Hadoop tools are not mature enough (like Sqoop, HCatalog and so on), so you'll see many Java error stack traces and probably you'll want to hack the source code someday
Answer 2
It is not required for you to know Java.
As the others said, it would be very helpful depending on how complex your processing may be. However, there is an incredible amount you can do with just Pig and say Hive.
I would agree that it is fairly likely you will eventually need to write a user defined function (UDF), however, I've written those in Python, and it is very easy to write UDFs in Python.
Granted, if you have very stringent performance requirements, then a Java based MapReduce program would be the way to go. However, great advancements in performance are being made all of the time in both Pig and Hive.
So, the short answer to your question is, "No", it is not required for you to know Java in order to perform Hadoop development.
Source :
http://www.linkedin.com/groups/Is-it-must-Hadoop-Developer-988957.S.141072851
1) Learn Java. No way around that, sorry.
2) Profit! It'll be very easy after that -- Hadoop is pretty darn simple.
It sounds like you are on the right track. I recommend setting up some Virtual Machines on your home computer to start taking what you see in the books and implementing them in your VMs. As with many things the only way to become better at something is to practice it. Once you get into I am sure you will have enough knowledge to start a small project to implement Hadoop with. Here are some examples of things people have built with Hadoop: Powered by Hadoop
Go through the Yahoo Hadoop tutorial before going through Hadoop the definitive guide. The Yahoo tutorial gives you a very clean and easy understanding of the architecture.
I think the concepts are not arranged properly in the Book. That makes it a little difficult to study it.
So do not study it together. Go through the web tutorial first.
I just put together a paper on this topic. Great resources above, but I think you'll find some additional pointers here: http://images.globalknowledge.com/wwwimages/whitepaperpdf/WP_CL_Learning_Hadoop.pdf
Feel free to join my blog about Big Data - https://oyermolenko.blog. I’ve been working with Hadoop for a couple of years and in this blog want to share my experience from the early start. I came from .NET environment and faced a couple of challenges related to switching from one language into another. My blog is oriented on people who didn’t work with Hadoop but have some primary technical background like you. Step by step I want to cover the whole family of Big Data services, describe the concepts and common problems I met working with them. Hope you will enjoy it

Resources