standard on hadoop coding - hadoop

Can I get any reference of any document which explains about standard of different hadoop applications i.e. HIVE, HBase, PiG, sqoop, Oozie. By standard I mean, the standard / best practice should be followed during coding etc.
e.g. one standard I know that in Hadoop we shouldn't go for large number of small files rather we should go for small number of big files (means by avoiding unnecessary partitions in HIVE tables).
I am looking for standards in other area like this.

If you mean "coding style" and general coding practices when doing stuff to be included inside Hadoop, then https://wiki.apache.org/hadoop/CodeReviewChecklist pops up the first thing when googling for "hadoop coding style".
If you mean anything else, then it's clearly too broad question to be answered here.

Related

Tutorial on performance analysis of pig and hive scripts

I am looking for good tutorials on doing performance analysis and improvement of pig latin scripts and hive scripts.
I'm not aware of any such tutorial. The only good way in my view is to do it yourself keeping your data and your case in mind.
Having said that, you can make use of something like TPC-H to benchmark your queries and based on the results you can improve and optimize your Pig and Hive queries, in case you find some performance bottlenecks. This will also help you in figuring out what Pig and Hive are not good at. Also, you can compare both the tools in case your are confused which one to go for for a particular task.
You can find more on this by visiting the below specified links :
Running TPC-H Benchmark on Pig Ticket.
Running TPC-H Benchmark on Pig Ticket.
And if you need all the details, you can visit the original papers on Running TPC-H on Pig and Hive. These papers contain great deal of information and you will definitely find them helpful during the process.
HTH
I'm not sure if it's what you're looking for but Big Data University has some pretty good tutorials on Hive and Pig. Give it a shot. You'll need the IBM QuickStart VM. Its a huge download but its free and pretty good.
Link:
http://www-01.ibm.com/software/data/infosphere/biginsights/quick-start/
There's tutorials on the VM as well that are pretty good but i think the ones at BigDataUni are better.
In case it matters, I registered on both websites and haven't gotten any spam or anything.

What are some good resources for studying Hadoop's source code?

Are there any good resources that would help me study Hadoop's source code? I'm particularly looking for university courses or research papers.
Studying Hadoop or MapReduce can be a daunting task if you get your hand dirty at the start.I followed the schedule as follows :
Start with very basics of MR with
code.google.com/edu/parallel/dsd-tutorial.html
code.google.com/edu/parallel/mapreduce-tutorial.html
Then go for the first two lectures in
www.cs.washington.edu/education/courses/cse490h/08au/lectures.htm
A very good course intro to MapReduce and Hadoop.
Read the seminal paper
http://research.google.com/archive/mapreduce.html and its improvements in the updated version
http://www.cs.washington.edu/education/courses/cse490h/08au/readings/communications200801-dl.pdf
Then go for all the other videos in the U.Washington link given above.
Try youtubing the terms Map reduce and hadoop to find videos by ORielly and Google RoundTable for good overview of the future of Hadoop and MapReduce
Then off to the most important videos -
Cloudera Videos
www.cloudera.com/resources/?media=Video
and
Google MiniLecture Series
code.google.com/edu/submissions/mapreduce-minilecture/listing.html
Along with all the Multimedia above we need good written materialDocuments:
Architecture diagrams at hadooper.blogspot.com are good to have on your wall
Hadoop: The definitive guide goes more into the nuts and bolts of the whole system where as
Hadoop in Action is a good read with lots of teaching examples to learn the concepts of hadoop.
Pro Hadoop is not for beginners
pdfs of the documentation from Apache Foundation
hadoop.apache.org/common/docs/current/ and
hadoop.apache.org/common/docs/stable/
will help you learn as to how model your problem into a MR solution in order to gain the advantages of Hadoop in total.
HDFS paper by Yahoo! Research is also a good read in order to gain in depth knowledge of hadoop
Subscribe to the User Mailing List of Commons, MapReduce and HDFS in order to know problems, solutions and future solutions.
Try the http://developer.yahoo.com/hadoop/tutorial/module1.html link for beginners to expert path to Hadoop
For Any Queries ... Contact Apache, Google, Bing, Yahoo!
Your question seems overly broad - To get a resource to use while looking at source code you should narrow your focus of what you want to study. This will make it easier for you (and any on SO) to find papers/topics covering that topic.
I've dug into the Hadoop source a few times. Normally with a very specific class I needed to learn about. In these cases an external resource wasn't really needed, and since I had the class name, I just googled for that and found resources.
If I were to start trying to understand the hadoop source at a higher level I'd get the source code and my copy of Hadoop: The Definitive Guide and use that as a reference to understand the higher level connections of the source code.
I won't claim that this would be a perfect solution. H:TDG is at a more technical level than the other hadoop books I have and I find it to be very informative.
H:TDG is what I'd start with and as I found areas I wanted to dig into more, I would start searching for those specifically.

does anyone find Cascading for Hadoop Map Reduce useful?

I've been trying Cascading, but I cannot see any advantage over the classic map reduce approach for writing jobs.
Map Reduce jobs gives me more freedom and Cascading seems to be putting a lot of obstacles.
Might make a good job for making simple things simple, but complex things.. I find them extremely hard
Is there something I'm missing. Is there an obvious advantage of Cascading over the classic approach?
In what scenario should I chose cascading over the classic approach? Any one using it and happy?
Keeping in mind I'm the author of Cascading...
My suggestion is to use Pig or Hive if they make sense for your problem, Pig especially.
But if you are in the business of data, and not just poking around your data for insights, you will find the Cascading approach makes much more sense for most problems than raw MapReduce.
Your first obstacle with raw MapReduce will be thinking in MapReduce. Trivial problems are simple in MapReduce, but its much easier to develop complex applications if you can work with a model that more easily maps to your problem domain (filter this, parse that, sort those, join the rest, etc).
Next you will realize that a normal unit of work in Hadoop consists of multiple MapReduce jobs. Chaining jobs together is a solvable problem but it should not leak into your application domain level code, it should be hidden and transparent.
Further, you will find refactoring and creating re-usable code much harder if you have to continually move functions between mappers and reducers. or from mappers to the previous reducer to get an optimization. Which leads to the issue of brittleness.
Cascading believes in failing fast as possible. The planner attempts to resolve and satisfy dependencies between all those field names before the Hadoop cluster is even engaged in work. This means 90%+ of all issues will be found before waiting hours for your job to find it during execution.
You can alleviate this in raw MapReduce code by creating domain objects like Person or Document, but many applications don't need all the fields down stream. Consider if you needed the average age of all males. You do not want to pay the IO penalty of passing a whole Person around the network when all you need is a binary gender and numeric age.
With fail fast semantics and lazy binding of sinks and sources, it becomes very easy to build frameworks on Cascading that themselves create Cascading flows (which become many Hadoop MapReduce jobs). A project I'm currently involved with ends up with 100's of MapReduce jobs per run, many created on the fly mid run based on feedback from the data being processed. Search for Cascalog to see an example of a Clojure based framework for simply creating complex processes. Or Bixo for a web mining toolkit and framework that's far easier to customize than Nutch.
Finally Hadoop is never used alone, that means your data is always pulled from some external source and pushed to another after processing. The dirty secret about Hadoop is that it is a very effective ETL framework (so its silly to hear ETL vendors talk about using their tools to push/pull data onto/from Hadoop). Cascading eases this pain somewhat by allowing you to write your operations, applications, and unit tests independent of the integration end-points. Cascading is used in production to load systems like Membase, Memcached, Aster Data, Elastic Search, HBase, Hypertable, Cassandra, etc. (Unfortunately not all the adapters have been released by their authors.)
If you will, please send me a list of the issues your are experiencing with the interface. I am constantly looking for better ways to improve the API and documentation, and the user community is always around to help.
I've been using Cascading for a couple of years now. I find it to be extremely helpful. Ultimately, it's about Productivity gains. I can be much more efficient in creating and maintaining M/R jobs as compared to plain java code. Here's a few reasons why:
A lot of the boilerplate code used to start a job is already written for you.
Composability. Generally code is easier to read and easier to reuse when it is written as components (operations) which are stitched together to perform some more complex processing.
I find unit testing to be easier. There are examples in the cascading package demonstrating how to write simple unit tests to directly test the output of flows.
The Tap (source and sink) paradigm makes it easy to change the input and ouput of a job, so you can, for example, start with output to STDOUT for development and debugging and then switch to HDFS sequencefiles for batch jobs and then switch to an HBase tap for pseudo-real time updates.
Another great advantage of writing Cascading jobs is that you're really writing more of a factory that creates jobs. This can be a huge advantage when you need to build something dynamically (i.e. the results of one job control what subsequent jobs you create and run). Or, in another case, I needed to create a job for each combination of 6 binary variables. This is 64 jobs which are all very similar. This would be a hassle with just hadoop map reduce classes.
While there are a lot of pre-built components that you can compose together, if a particular section of your processing logic seems like it would be easier to just write in straight Java, you can always create a Cascading function to wrap that. This allows you to have the benefits of Cascading, but very custom operations can be written as straight java functions (implementing a Cascading interface).
I used Cascading with Bixo to write the complete anti-spam link classification pipeline for a large social network.
The Cascading pipeline resulted in 27 MR jobs, which would have been very difficult to maintain in plain MR. I have written MR jobs before, but using something like Cascading feels like switching from Assembler to Java (insert_fav_language_here).
One of the big advantages over Hive or Pig IMHO is that Cascading is a single jar, which you bundle with your job. Pig and Hive have more dependencies (e.g. MySQL) or are not as easy to embed.
Disclaimer: While I know Chris Wensel personally, I really think Cascading is kick a**. Considering its complexity it is extremely impressive that I haven't found a single bug using it.
I teach the Hadoop Boot Camp course for Scale Unlimited, and also make extensive use of Cascading in Bixo and for building web mining apps at Bixo Labs - so I think I've got a good appreciation for both approaches.
The biggest single advantage I see in Cascading is that it allows you to think about your data processing workflow in terms of operations on fields, and to (mostly) avoid worrying about how to transpose this view of the world onto the key/value model that's intrinsically part of any map-reduce implementation.
The biggest challenge with Cascading is that it is a different way of thinking about data processing workflows, and there's a corresponding conceptual "hump" you need to get over before it all starts making sense. Plus the error messages can remind one of the output from lex/yacc ("conflict in shift/reduce") :)
-- Ken
I think that the place that Cascading's advantages begin to show are instances where you have a pile of simple functions that should all be kept separate in source code, but which can all be collected into a composition in your mapper or reducer. Putting them together makes your basic map-reduce code hard to read, separating them makes the program really slow. Cascading's optimizer can put them together even though you write them separately. Pig and to some extent Hive can do this as well, but for large programs, I think Cascading has a maintainability advantage.
In a few months Plume may be an expressivity competitor, but if you have real programs to write and run in a production setting, then Cascading is probably your best bet.
Cascading allows you to use simple field names and tuples in place of the primitive types offered by Hadoop which, "... tend to be at the wrong level of granularity for creating sophisticated, highly composable code that can be shared among different developers" (Tom White, Hadoop The Definitive Guide). Cascading was designed to solve those problems. Keep in mind, some of the applications like Cascading, Hive, Pig, etc, were developed in parallel and sometimes do the same thing. If you don't like Cascading or find it confusing, maybe you would be better of using something else?
I'm sure you already have this, but here is the user guide: http://www.cascading.org/1.1/userguide/pdf/userguide.pdf. It provides a decent walk through of the flow of data in a typical Cascading application.
I worked on cascading for couple of years and below are useful things in cascading.
1. code testability
2. easy integration with other tools
3. easily extensibile
4. you will focus only on business logic not on keys and values
5. proven in production and used by even twitter.
I recommend people use cascading most of the times.
Cascading is a wrapper around Hadoop that provides Taps and Sinks to and from Hadoop.
Writing Mappers and Reducers for all your tasks is going to be tedious. Try writing one Cascading job and then you're all set to avoiding writing any mappers and reducers.
You also want to look at cascading Taps and Schemes (this is how you input data into your cascading processing job).
With these two, i.e. Ability to avoid writing ad-hoc Hadoop Mappers with Reducers and the ability to consume a wide variety of data sources, you can solve a lot of your data processing very fast and effective.
Cascading is more than just a simple wrapper around hadoop, I am trying to keep the answer simple. For example, I've ported a huge mysql database containing terabytes of data to log files using cascading jdbc tap

Getting started with massive data

I'm a mathematician and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I need to know and what are some good resources to learn from?
Hadoop/MapReduce is one obvious start.
Is there a particular programming language I should pick up? (I primarily work now in Python, Ruby, R, and occasionally Java, but it seems like C and Clojure are often used for large-scale data analysis?)
I'm not really familiar with the whole NoSQL movement, except that it's associated with big data. What's a good place to learn about it, and is there a particular implementation (Cassandra, CouchDB, etc.) I should get familiar with?
Where can I learn about applying machine learning algorithms to huge amounts of data? My math background is mostly on the theory side, definitely not on the numerical or approximation side, and I'm guessing most of the standard ML algorithms don't really scale.
Any other suggestions on things to learn would be great!
Apache Hadoop is indeed a good start, because it's free, has a large community and is easy to set up.
Hadoop is build in Java, so this can be the language of choice. But it is possible to use ohter languages with Hadoop as well ("pipes" and "streams"). I know, that Python is often used for example.
You can avoid having your data in data bases, if you like to. Originally, Hadoop works with data on the (distributed) file system. But as you already seem to know, there are distributed data bases for Hadoop available.
Did you ever had a look an Mahout? I think that would be a hit for you ;-) Many work you need, may already had been done!?
Read the Quick Start and set up your own (pseudo-distributed?) cluster and run the word-count example.
Let me know, if you have any questions :-) A comment will remind me on this question.
I've done some large scale machine learning (3-5GB datasets), so here are some insights:
First, there are logistics issues at large scales. Can you load all your data into memory? With Java and a 64 bit JVM you can access as much RAM as you have: for example, command line parameter -Xmx8192M will give you access to 8GB (if you have that much). Matlab, being a Java application, can also benefit from this and work with fairly large datasets.
More importantly, the algorithms that you run on your data. Chances are that standard implementations will expect all of the data in memory. You might have to implement a working set approach yourself, where you swap data in and out to the disk, and only work on a portion of data at a time. These are sometimes referred to as chunking, batch or even incremental algorithms, depending on the context.
You are right to suspect that a lot of algorithms do not practically scale, so you might have to go for an approximate solution. The good news is that for almost any algorithm you can find research papers that deal with approximation and/or discuss large scale solutions. The bad news is that you'll most likely have to implement those approaches yourself.
Hadoop is great, but can be a pain in the ass to set up. This is by far the best article I've read on Hadoop setup. I strongly recommend it:
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
Clojure is built on top of Java so it's unlikely that it's going to be any faster than Java. However, it is one of the few languages that does shared memory well, which may or may not be helpful. I'm not a math guy but it seems most math calculations are very parallelizable, with little need of threads sharing memory. Either way, you might want to check out Incanter, which is Clojure's statistical computing library, and clojure-hadoop, which makes writing Hadoop jobs a lot less painful.
In terms of languages, I find that the differences in performance end up being constant factors. It's far better to just find a language you enjoy and focus on improving your algorithms. However, according to some shootout cited by Peter Norvig (scroll down to the colorful table, you may want to shy away from Python and Perl due to their crappiness with arrays.
In a nutshell, NoSQL is great for unstructured/arbitrarily structured data while SQL/RDBMS is great (or at least tolerable) for structured data. Changing/adding fields is expensive in RDBMS so if that's going to happen alot, you might want to shy away from them.
However, in your case, it seems like you're going to be batch processing a ton of data and then getting back an answer as opposed to having data around that you will periodically ask questions about? You could probably just process CSVs/text files in Hadoop. Unless you need a performant way of accessing arbitrary information about your data on the fly, I'm not sure either SQL or NoSQL would be useful.

Where do I start with distributed computing?

I'm interested in learning techniques for distributed computing. As a Java developer, I'm probably willing to start with Hadoop. Could you please recommend some books/tutorials/articles to begin with?
Maybe you can read some papers related to MapReduce and distributed computing first, to gain a better understanding of it. Here are some I would like to recommand:
MapReduce: Simplified Data Processing on Large Clusters, http://www.usenix.org/events/osdi04/tech/full_papers/dean/dean_html/
Bigtable: A Distributed Storage System for Structured Data, http://www.usenix.org/events/osdi06/tech/chang/chang_html/
Dryad: Distributed data-parallel programs from sequential building blocks, http://pdos.csail.mit.edu/6.824-2007/papers/isard-dryad.pdf
The landscape of parallel computing research: A view from berkeley, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.8705&rep=rep1&type=pdf
On the other hand, if you want to know better of Hadoop, maybe you can start reading Hadoop MapReduce framework source code.
Currently, bookwise I would check out - Hadoop A Definitive Guide. Its written by Tom White who has worked on Hadoop for a good while now, and works at Cloudera with Doug Cutting (Hadoop creator).
Also on the free side, Jimmy Lin from UMD has written a book called: Data-Intensive Text Processing with MapReduce. Here's a link to the final pre-production verison (link provided by the author on his website).
Hadoop is not necessarily the best tool for all distributed computing problems. Despite its power, it also has a pretty steep learning curve and cost of ownership.
You might want to clarify your requirements and look for suitable alternatives in the Java world, such as HTCondor, JPPF or GridGain (my apologies to those I do not mention).
Here are some resources from Yahoo! Developer Network
a tutorial:
http://developer.yahoo.com/hadoop/tutorial/
an introductory course (requires Siverlight, sigh):
http://yahoo.hosted.panopto.com/CourseCast/Viewer/Default.aspx?id=281cbf37-eed1-4715-b158-0474520014e6
The All Things Hadoop Podcast http://allthingshadoop.com/podcast has some good content and good guests. A lot of it is geared to getting started with Distributed Computing.
MIT 6.824 is the best stuff. Only reading google papers related to Hadoop is not enough. A systematic course learning is required if you want to go deeper.
If you are looking to learn a distributed computing platform that is less complicated than Hadoop you can try Zillabyte. You only need to know some Ruby or Python to build apps on the platform.
As LoLo said, Hadoop is a powerful solution, but can be rough to start with.
For materials to learn about distributed computing try http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-824-distributed-computer-systems-engineering-spring-2006/syllabus/. There are several resources recommended by the course as well.

Resources