What are some good resources for studying Hadoop's source code? - hadoop

Are there any good resources that would help me study Hadoop's source code? I'm particularly looking for university courses or research papers.

Studying Hadoop or MapReduce can be a daunting task if you get your hand dirty at the start.I followed the schedule as follows :
Start with very basics of MR with
code.google.com/edu/parallel/dsd-tutorial.html
code.google.com/edu/parallel/mapreduce-tutorial.html
Then go for the first two lectures in
www.cs.washington.edu/education/courses/cse490h/08au/lectures.htm
A very good course intro to MapReduce and Hadoop.
Read the seminal paper
http://research.google.com/archive/mapreduce.html and its improvements in the updated version
http://www.cs.washington.edu/education/courses/cse490h/08au/readings/communications200801-dl.pdf
Then go for all the other videos in the U.Washington link given above.
Try youtubing the terms Map reduce and hadoop to find videos by ORielly and Google RoundTable for good overview of the future of Hadoop and MapReduce
Then off to the most important videos -
Cloudera Videos
www.cloudera.com/resources/?media=Video
and
Google MiniLecture Series
code.google.com/edu/submissions/mapreduce-minilecture/listing.html
Along with all the Multimedia above we need good written materialDocuments:
Architecture diagrams at hadooper.blogspot.com are good to have on your wall
Hadoop: The definitive guide goes more into the nuts and bolts of the whole system where as
Hadoop in Action is a good read with lots of teaching examples to learn the concepts of hadoop.
Pro Hadoop is not for beginners
pdfs of the documentation from Apache Foundation
hadoop.apache.org/common/docs/current/ and
hadoop.apache.org/common/docs/stable/
will help you learn as to how model your problem into a MR solution in order to gain the advantages of Hadoop in total.
HDFS paper by Yahoo! Research is also a good read in order to gain in depth knowledge of hadoop
Subscribe to the User Mailing List of Commons, MapReduce and HDFS in order to know problems, solutions and future solutions.
Try the http://developer.yahoo.com/hadoop/tutorial/module1.html link for beginners to expert path to Hadoop
For Any Queries ... Contact Apache, Google, Bing, Yahoo!

Your question seems overly broad - To get a resource to use while looking at source code you should narrow your focus of what you want to study. This will make it easier for you (and any on SO) to find papers/topics covering that topic.
I've dug into the Hadoop source a few times. Normally with a very specific class I needed to learn about. In these cases an external resource wasn't really needed, and since I had the class name, I just googled for that and found resources.
If I were to start trying to understand the hadoop source at a higher level I'd get the source code and my copy of Hadoop: The Definitive Guide and use that as a reference to understand the higher level connections of the source code.
I won't claim that this would be a perfect solution. H:TDG is at a more technical level than the other hadoop books I have and I find it to be very informative.
H:TDG is what I'd start with and as I found areas I wanted to dig into more, I would start searching for those specifically.

Related

What is the best way to learn about the Hadoop ecosystem

I'm a Data Scientist with a background in pure mathematics, so i have a bit of a learning curve in terms of tools. By working in the industry for about a year, i understand that a Data Scientist should also know some Data Engineering. Can anyone point me to some resources? My current tech stack includes mostly of Python, (Pyspark) etc.
Depends what exactly do you want to learn about Hadoop Ecosystem.
I would recommend you to start from this book:
Hadoop: The Definitive Guide it can help you to understand how it works under the hood and get some understanding what Hadoop ecosystem consists of. You don't need all chapters of this book, but many of them may be really useful.
Also you should probably check this book
Spark - The Definitive Guide due to spark is commonly used in Data Science area. But it's more practical book than the previous one.

standard on hadoop coding

Can I get any reference of any document which explains about standard of different hadoop applications i.e. HIVE, HBase, PiG, sqoop, Oozie. By standard I mean, the standard / best practice should be followed during coding etc.
e.g. one standard I know that in Hadoop we shouldn't go for large number of small files rather we should go for small number of big files (means by avoiding unnecessary partitions in HIVE tables).
I am looking for standards in other area like this.
If you mean "coding style" and general coding practices when doing stuff to be included inside Hadoop, then https://wiki.apache.org/hadoop/CodeReviewChecklist pops up the first thing when googling for "hadoop coding style".
If you mean anything else, then it's clearly too broad question to be answered here.

Tutorial on performance analysis of pig and hive scripts

I am looking for good tutorials on doing performance analysis and improvement of pig latin scripts and hive scripts.
I'm not aware of any such tutorial. The only good way in my view is to do it yourself keeping your data and your case in mind.
Having said that, you can make use of something like TPC-H to benchmark your queries and based on the results you can improve and optimize your Pig and Hive queries, in case you find some performance bottlenecks. This will also help you in figuring out what Pig and Hive are not good at. Also, you can compare both the tools in case your are confused which one to go for for a particular task.
You can find more on this by visiting the below specified links :
Running TPC-H Benchmark on Pig Ticket.
Running TPC-H Benchmark on Pig Ticket.
And if you need all the details, you can visit the original papers on Running TPC-H on Pig and Hive. These papers contain great deal of information and you will definitely find them helpful during the process.
HTH
I'm not sure if it's what you're looking for but Big Data University has some pretty good tutorials on Hive and Pig. Give it a shot. You'll need the IBM QuickStart VM. Its a huge download but its free and pretty good.
Link:
http://www-01.ibm.com/software/data/infosphere/biginsights/quick-start/
There's tutorials on the VM as well that are pretty good but i think the ones at BigDataUni are better.
In case it matters, I registered on both websites and haven't gotten any spam or anything.

Is it worth purchasing Mahout in Action to get up to speed with Mahout, or are there other better sources?

I'm currently a very casual user of Apache Mahout, and I'm considering purchasing the book Mahout in Action. Unfortunately, I'm having a really hard time getting an idea of how worth it this book is -- and seeing as it's a Manning Early Access Program book (and therefore only currently available as a beta-version e-book), I can't take a look myself in a bookstore.
Can anyone recommend this as a good (or less good) guide to getting up to speed with Mahout, and/or other sources that can supplement the Mahout website?
Speaking as a Mahout committer and co-author of the book, I think it is worth it. ;-)
But seriously, what are you working on? Maybe we can point you to some resources.
Some aspects of Mahout are just plain hard to figure out on your own. We work hard at answering questions on the mailing list, but it can really help to have sample code and a roadmap. Without some of that, it is hard to even ask a good question.
Also a co-author here. Being "from the horse's mouth" it's probably by far the most complete write-up out there for Mahout itself. There are some good blog posts out there, and certainly plenty of good books on more generally machine learning (I like Collective Intelligence in Action as a broad light intro). user#mahout.apache.org has a few people that say they like the book FWIW, as do the book forums (http://www.manning-sandbox.com/forum.jspa?forumID=623) I think you can return the e-book if it's not quite what you wanted. It definitely has 6 chapters on clustering.
there are many parts of the book that are out of date, a version or two behind what is current. In addition, there are several mistakes within the text, particularly within the examples. this may make things a bit tricky when trying to replicate the discussed results.
Additionally, you should be aware that the most mature part of mahout, the recommender system, taste, isnt distributed. I'm not really sure why this is packaged with the rest of mahout. this is more a complaint about the software package than mahout itself.
Currently the best out there. Probably as mature as the product. Some aspects are better than others, insight into the underlying implementation is good, practical methods to get up and running on Linux, mac osx, etc for beginners not so much. Defining a clear strategy about how to keep a recommender updated is iffy. Production examples pretty thin. Good as a starting point but you need a lot more. Authors make best attempt to help, but is a pretty new product. All in all, yes, buy it.
I got the book a few weeks ago. Highly recommended. The authors are very active on the mailing list, too, and there is a lot of cool energy in this project.
You might also consider reading through Paco Nathan's Enterprise Data Workflows in Cascading. You can run PMML on your cluster exported from R or SAS. That is not to say anything bad about Mahout in Action, the authors did a great job and clearly put good time and effort into making it instructive and interesting. This is more of a suggestion to look beyond Mahout. It's not currently getting the kind of traction it would if it were more user friendly.
As it stands, the Mahout user experience is kinda choppy, and doesn't really give you a clear idea of how to develop and update intelligent systems and their life cycles, IMO. Mahout is not really acceptable for academics either, they are more likely to use Matlab or R. In the Mahout docs, the random forest implementation barely works and the docs have erroneous examples, etc... Thats frustrating, and the parallelism and scalability of the Mahout routines depend on the algorithm. I don't currently see Mahout going anywhere solid as it stands, again IMO. I hope I'm wrong!
http://shop.oreilly.com/product/0636920028536.do

Where do I start with distributed computing?

I'm interested in learning techniques for distributed computing. As a Java developer, I'm probably willing to start with Hadoop. Could you please recommend some books/tutorials/articles to begin with?
Maybe you can read some papers related to MapReduce and distributed computing first, to gain a better understanding of it. Here are some I would like to recommand:
MapReduce: Simplified Data Processing on Large Clusters, http://www.usenix.org/events/osdi04/tech/full_papers/dean/dean_html/
Bigtable: A Distributed Storage System for Structured Data, http://www.usenix.org/events/osdi06/tech/chang/chang_html/
Dryad: Distributed data-parallel programs from sequential building blocks, http://pdos.csail.mit.edu/6.824-2007/papers/isard-dryad.pdf
The landscape of parallel computing research: A view from berkeley, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.8705&rep=rep1&type=pdf
On the other hand, if you want to know better of Hadoop, maybe you can start reading Hadoop MapReduce framework source code.
Currently, bookwise I would check out - Hadoop A Definitive Guide. Its written by Tom White who has worked on Hadoop for a good while now, and works at Cloudera with Doug Cutting (Hadoop creator).
Also on the free side, Jimmy Lin from UMD has written a book called: Data-Intensive Text Processing with MapReduce. Here's a link to the final pre-production verison (link provided by the author on his website).
Hadoop is not necessarily the best tool for all distributed computing problems. Despite its power, it also has a pretty steep learning curve and cost of ownership.
You might want to clarify your requirements and look for suitable alternatives in the Java world, such as HTCondor, JPPF or GridGain (my apologies to those I do not mention).
Here are some resources from Yahoo! Developer Network
a tutorial:
http://developer.yahoo.com/hadoop/tutorial/
an introductory course (requires Siverlight, sigh):
http://yahoo.hosted.panopto.com/CourseCast/Viewer/Default.aspx?id=281cbf37-eed1-4715-b158-0474520014e6
The All Things Hadoop Podcast http://allthingshadoop.com/podcast has some good content and good guests. A lot of it is geared to getting started with Distributed Computing.
MIT 6.824 is the best stuff. Only reading google papers related to Hadoop is not enough. A systematic course learning is required if you want to go deeper.
If you are looking to learn a distributed computing platform that is less complicated than Hadoop you can try Zillabyte. You only need to know some Ruby or Python to build apps on the platform.
As LoLo said, Hadoop is a powerful solution, but can be rough to start with.
For materials to learn about distributed computing try http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-824-distributed-computer-systems-engineering-spring-2006/syllabus/. There are several resources recommended by the course as well.

Resources