Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Is there an easy way to use Hadoop other than with the command line?
Which tools are you using and which one is the best?
Hue is pretty cool, new features are regularly pushed out and it's open source.
From its website:
Hue features a File Browser for HDFS, a Job Designer/Browser for MapReduce, query editors for Hive, Pig, Cloudera Impala and Solr Search.
It also ships with an Oozie Application for creating workflows, various Shells and a collection of Hadoop API.
Although Enrico already answered the question, I would like to add few points to that.
Hue is a really amazing tool and we have been using it at Goibibo.com for last 1 year. We have exposed it to developers and business people for running their hive queries and getting results.
Also we are indexing log data so cloudera search comes in pretty handy. With the new version of Hue (3.6), you can also run queries on RDBMS data using HUE itself.
I would really recommend using it because its really simple to use and provides a GUI to mostly everything there in the bigdata ecosystem.
If you are on windows, you can use an open source project called HDFS Explorer.
If you're on a Mac or Linux, then you can mount Hadoop filesystems directly using FUSE, and then use Finder, or Nautilus, or whatever you normally use for filesystem navigation. Checkout the Hadoop wiki on how to setup the mounts http://wiki.apache.org/hadoop/MountableHDFS
Each distribution provides a web based GUI, in some cases Hue, and in others based on the new Ambari views framework, which provide access to file functionality.
You can look for some Data Integration tools like Talend, CloverETL or Pentaho. They have provided support for Hadoop.Talend has provided a vast support of it. Don't have much information about the other tools support for Hadoop.
If you are just looking for something a step up from the cli for exploring, my installation has a web server that installed with hadoop and it is accessible at :50075. The port is configurable, but give that a try.
If you are using CDH, then Hue(hadoop user Interface) comes with it. And its a very good user interface for hadoop. You can also use install separately. It supports all components of hadoop.
This is a light-weight Hadoop file manager HFX. You can use this. It has some essential features like drag and drop, upload, cut, copy and paste...
Related
I want to install hadoop, pig and hive in my laptop. I don't know how to install and configure hadoop,pig and hive and what software are required to do it.
Please let me know exact steps require to install/configure Hadoop, Pig and hive in laptop.
and i can use windows OS and i install the hadoop in windows OS
For beginners, I would recommend sticking to a good prepackaged Hadoop distribution/sandbox. Even if you want to learn how to setup up a Hadoop cluster before using the tools it provides (e.g. Hive etc.), setting up a common distribution is a lot easier at least in the beginning.
Prepackaged sandboxes for Hadoop are going to be in Linux. But most likely, you will not need to do a lot in Linux to start using Hadoop if you start from these sandboxes. Personally, I think the time you will save by avoiding support and documentation issues on Windows ports will compensate greatly for any added effort required for jumping into Linux, and you will at least enter the domain of Linux which itself is a tremendously important tool.
For prepackaged solutions, you may try to aim at Cloudera quickstart VM or MapR quickstart VM as these are the most widely used distributions. By using sandboxes, you will skip the installation process (which may be hectic if you don't know what you want and specially if you aren't familiar with Linux) and jump right into usage of tools. Due to availability of good documentation for large vendors such as Cloudera and MapR, you will also face lesser issues in accessing the tools you want to learn.
Follow the vendor specific setup guidelines (also listed on the download pages as getting started guides) for further details on setting up the sandbox.
Once you have the sandbox setup, you can use a lot of different ways to access Hive and Pig. You can use a command line interface for Hive (called beeline). If you are familiar with JDBC, you can access Hive through that. Install Apache-Thrift to enable much wider access options, but you can also save that for later.
I would not recommend learning Pig unless you have very specific uses for it. If you are familiar with Java (or Scala, or even Python, among other options), try writing some Map-Reduce style jobs to learn more about how Hadoop works. Open Ambari (or Cloudera Manger etc.) interface which comes pre-configured with these sandboxes and see the tools and services that come pre-packaged with the sandbox. These are the most common ones and can be used as a useful list for starters. Start learning about them (but skip Pig if you can, even if it is pre-installed ;)
Once you are familiar with the sandbox you have, I would suggest going for Apache Nifi which has easier learning curve and give a lot of flexibility. But you will most likely have to setup a new sandbox for that. It may also serve as a good revision exercise for learning. Integrate that with your Hadoop sandbox, implement some decent use cases and you will have some good experience to show.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am using Cloudera Hadoop, I want to perform Spatial Analytics and need to connect to quantum GIS(QGIS) for geospatial purposes. I need to know how to connect both QGIS and Hadoop. Also is there any other way to connect any other GIS systems except ARCGIS?
There are a number of free and open-source offerings you can use to achieve your goals.
From the list of LocationTech projects, I'd note that GeoJinni (formerly Spatial Hadoop), GeoMesa, and GeoTrellis all work with Hadoop or distributed databases like Accumulo or Cassandra.
More generally, since working with Hadoop means using Java, I'd recommend the GeoTools project for processing geo/gis data on the JVM. GeoTools is used as a library for GeoServer to publish geospatial data using open standards. GeoServer and MapServer are two of the open alternatives for Arc server products.
As you are looking for alternatives to Arc desktop products, QGIS and OpenJump are both options.
As a concrete, small examples, I've used the GeoTools library to read shapefiles from HDFS for ingest into GeoMesa with no problems. Previously, I have looked at serving up geotiffs hosted on HDFS/S3 through GeoServer, and there were a few small changes necessary to wire that up through the stack, but I was able to do it since all the software involved is open-source.
(In full-disclosure, I work on GeoMesa, participate quite a bit with LocationTech projects and some with GeoTools/GeoServer.)
GeoWave is a project that matches the criteria.
The goal of the project is primarily to connect the Hadoop ecosystem with popular GIS software which seems to fit well here. It enables storage/retrieval/analysis of raster/vector/point cloud data within various distributed key-value stores.
Also, the project has installers for Cloudera in addition to other vendors - the latest release with support for Cloudera 5.12.1.
Please keep in mind that I am a GeoWave core contributor.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have a Hadoop based environment. I use Flume, Hue and Cassandra in this system. There is a big hype around Docker nowadays, so would like to examine, what are pros and cons in dockerization in this case. I think it should be much more portable, but it can be set using Cloudera Manager with a few clicks. Is it maybe faster or why is worth it? What are advantages?
Maybe should be only multi node Cassandra cluster dockerized?
Is it maybe faster or why is worth it?
It sounds like you already have a Hadoop cluster. So you have to ask yourself, how long does it take to reproduce this environment? How often do you need to reproduce this environment?
If you are not needing a way to reproduce the environment repeatedly and and contain dependencies that may be conflicts with other applications on the host, then I don't yet see a use case for you.
What are advantages?
If you are running Hadoop in an environment where you may need mixed Java versions, then running it as a container could isolate the dependencies (in this case, Java) from the host system. In some case, it would get you a more easily reproducible artifact to move around and set up. But Java apps are already so simple with all their dependencies included in the JAR.
Maybe should be only multi node Cassandra cluster dockerized?
I don't think it really comes down to whether is is a multi-node environment or not. It comes down to the problems it solves. It doesn't sound like you have any pain point in deploying or reproducing Hadoop environments (yet), so I don't see the need to "dockerize" something just because it is the hot new thing on the block.
When you do have the need to reproduce the Hadoop environment easily, you might look at Docker for some of the orchestration and management tools (Kubernetes, Rancher, etc.) which make deploying and managing clusters of applications on an overlay network much more appetizing than just regular Docker. Docker is just the tool in my eyes. It really starts to shine when you can leverage some of the neat overlay multi-host networking, discovery, and orchestration that other packages are building on top of it.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Just started going through Hadoop introduction videos.
How to practice it on your own? Is there a recommended way to install on local to practice?
I found that downloading and installing Hadoop, playing with it by working examples, making lots of mistakes and being ok with that worked well for practice.
By "install on local" if you're saying "how do I install it on my local machine without using HDFS?", there's an excellent guide here.
If you want to learn about Hadoop and Bigdata, look into bigdatauniversity.com. Its free and they give instructions on how to install Hadoop locally on a virtual machine and/or in Amazon's Web Services. BigDataUniversity provides labs and instructions to help guide your practice. I found it helpful so far.
Recently Cloudera launched a new platform online where you can play with Hadoop and its ecosystem as much as you want.Here you go -
cloudera.com/live
I have been training people on Hadoop for 2 years now. Here are my two cents.
For the learning part, I would recommend the following sources (as mentioned by others too above):
Yahoo Blog
Hadoop Definitive Guide
HortonWorks Practice Tutorials
And for practicing, traditionally people have been using Hadoop Virtual Machines but this approach has its downsides:
The VMs are huge in size for example HortonWorks' VM is 9.9 GB.
You might have to upgrade your RAM to 8GB.
Some BIOS don't allow virtualization. You might have change bios settings.
Some machines such as Office Desktops/Laptops may not allow installations.
My students and I too faced the these problems while. So, we setup a cluster for our students to practice Hadoop, Spark and related technologies. And we named it as CloudxLab.com.
...I liked bigdatauniversity.com and also noted that MapR, Hortonworks, and Cloudera all offer a downloadable environment that you can use to gain familiarity with the Hadoop operating paradigm.
In fact, if you are studying this with an eye toward working with Hadoop at an Enterprise scale, it's a good idea to explore the products that are being deployed at that level.
I've had a little chance now to explore hands-on with MapR's Hadoop environment and can commend it as a good way to looking into the matter.
---v
I would suggest https://developer.yahoo.com/hadoop/tutorial/ for hadoop self paced study. Its a very comprehensive guide, step by step, from beginner to advanced level.
You can install a virtual box that has Hadoop included but you may encounter some problems with it. I did so first when I started learning Hadoop and after several problems( IP, internet, different configs) I decided to learn with a Linux install.
You can find a tutorial here:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am a Web developer. I have experience in Web technologies like JavaScript , Jquery , Php , HTML . I know basic concepts of C. Recently I had taken interest in learning more about mapreduce and hadoop. So I enrolled my self in parallel data processing in mapreduce course in my university. Since I dont have any prior programing knowledge in any object oriented languages like Java or C++ , how should I go about learning map reduce and hadoop. I have started to read Yahoo hadoop tutorials and also OReilly's Hadoop The Definitive Guide 2nd.Edition.
I would like you guys to suggest me ways I could go about learning mapreduce and hadoop.
Here are some nice YouTube videos on MapReduce
http://www.youtube.com/watch?v=yjPBkvYh-ss
http://www.youtube.com/watch?v=-vD6PUdf3Js
http://www.youtube.com/watch?v=5Eib_H_zCEY
http://www.youtube.com/watch?v=1ZDybXl212Q
http://www.youtube.com/watch?v=BT-piFBP4fE
Also, here are nice tutorials on how to setup Hadoop on Ubuntu
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
You can access Hadoop from many different languages and a number of resources set up Hadoop for you. You could try Amazon's Elastic MapReduce (EMR), for instance, without having to go through the hassle of configuring the servers, workers, etc. This is a good way to get your head around MapReduce processing while delaying a bit the issues of learning how to use HDFS well, how to manage your scheduler, etc.
It's not hard to search for your favorite language & find Hadoop APIs for it or at least some tutorials on linking it with Hadoop. For instance, here's a walkthrough on a PHP app run on Hadoop: http://www.lunchpauze.com/2007/10/writing-hadoop-mapreduce-program-in-php.html
Answer 1 :
It is very desirable to know Java. Hadoop is written in Java. Its popular Sequence File format is dependent on Java.
Even if you use Hive or Pig, you'll probably need to write your own UDF someday. Some people still try to write them in other languages, but I guess that Java has more robust and primary support for them.
Most Hadoop tools are not mature enough (like Sqoop, HCatalog and so on), so you'll see many Java error stack traces and probably you'll want to hack the source code someday
Answer 2
It is not required for you to know Java.
As the others said, it would be very helpful depending on how complex your processing may be. However, there is an incredible amount you can do with just Pig and say Hive.
I would agree that it is fairly likely you will eventually need to write a user defined function (UDF), however, I've written those in Python, and it is very easy to write UDFs in Python.
Granted, if you have very stringent performance requirements, then a Java based MapReduce program would be the way to go. However, great advancements in performance are being made all of the time in both Pig and Hive.
So, the short answer to your question is, "No", it is not required for you to know Java in order to perform Hadoop development.
Source :
http://www.linkedin.com/groups/Is-it-must-Hadoop-Developer-988957.S.141072851
1) Learn Java. No way around that, sorry.
2) Profit! It'll be very easy after that -- Hadoop is pretty darn simple.
It sounds like you are on the right track. I recommend setting up some Virtual Machines on your home computer to start taking what you see in the books and implementing them in your VMs. As with many things the only way to become better at something is to practice it. Once you get into I am sure you will have enough knowledge to start a small project to implement Hadoop with. Here are some examples of things people have built with Hadoop: Powered by Hadoop
Go through the Yahoo Hadoop tutorial before going through Hadoop the definitive guide. The Yahoo tutorial gives you a very clean and easy understanding of the architecture.
I think the concepts are not arranged properly in the Book. That makes it a little difficult to study it.
So do not study it together. Go through the web tutorial first.
I just put together a paper on this topic. Great resources above, but I think you'll find some additional pointers here: http://images.globalknowledge.com/wwwimages/whitepaperpdf/WP_CL_Learning_Hadoop.pdf
Feel free to join my blog about Big Data - https://oyermolenko.blog. I’ve been working with Hadoop for a couple of years and in this blog want to share my experience from the early start. I came from .NET environment and faced a couple of challenges related to switching from one language into another. My blog is oriented on people who didn’t work with Hadoop but have some primary technical background like you. Step by step I want to cover the whole family of Big Data services, describe the concepts and common problems I met working with them. Hope you will enjoy it