Hadoop examples beside the word count

Hadoop examples beside the word count - hadoop

I like to learn Hadoop applications in the real world scenarios. Currently most of the example only cover the word count problem, and no any example on industrial use case.
Are there other Hadoop examples, or Hadoop tutorials out there, that solve other problem beside the word count problem?

See https://github.com/adamjshook/mapreducepatterns for source code examples that are documented in the book "Map Reduce Design Patterns" by Miner and Shook. I have tried them all and they all work in Hadoop using Cloudera's Training VMs.

Just to add a book that gives examples and brings simplicity and practicality to the real world!
Data Algorithms Recipes for Scaling Up with Hadoop and Spark By Mahmoud Parsian

Related

Numerical Recipes for Hadoop

I was dusting off my copy of Numerical Recipes in C book and started wondering if a similar treatment exists for algorithms that can be palatalized using MapReduce and Hadoop.
Thanks.

There is a nice compilation from atbrox for the different algorithms in different domains. For text processing here is nice one another. Also, search # CiteSeerx.

MapReduce project with data mining

I am planning to do a MapReduce project involving Hadoop libraries and testing it on big data uploaded at AWS. I have not finalized an idea yet. But I am sure it will involve some kind of data processing, MapReduce design patterns and possibly Graph algorithms, Hive and PigLatin. I would really appreciate if someone can give me some ideas about it. I have few of mine in mind.
In the end I have to work on some large data set and get some information and derive some conclusions. For this I have used Weka before for data mining, (using Trees).
But I am not sure if that is the only thing I can work with right now (using Weka). Is there any other ways by which I can work on large data and derive conclusions on the large data set?
Also how can I involve graphs in this ?
Basically I want to make a research project but I am not sure what exactly I should be working on and what it should be like ? Any thoughts ? suggestive links/ideas ? Knowledge sharing ?

I will suggest you check Apache Mahout, it a scalable machine learning and data mining framework that should integrate nicely with Hadoop.
Hive gives you SQL-like language to query big data, essentially it translates your high-level query into MapReduce jobs and run it on the data cluster.
Another suggestion is to consider doing your data processing algorithm using R, it is a statistical software (similar to matlab), and I would recommend instead of the standard R environment is to use R Revolution, which is an environment to develop R, but with much powerful tools for big data and clustering.
Edit: If you are a student, R Revolution has a free academic edition.
Edit: A third suggestion, is to look at GridGain which is another Map/Reduce implementation in Java that is relatively easy to run on a cluster.

As you are already working with MapRedude and Hadoop, you can extract some knowledge from your data using Mahout or you can get some ideas from this very good book:
http://infolab.stanford.edu/~ullman/mmds.html
This books provide ideas to mine Social-Network Graphs, and works with graphs in a couple of other ways too.
Hope it helps!

practical usage of hadoop map reduce hive pig hbase

Hello,
I am learning Hadoop and after reading the material found on the net (tutorials, map reduce concepts, Hive, Ping and so on) and developed some small application with those I would like to learn the real world usages of these technologies.
What are the everyday software we use that are based upon Hadoop stack?

If you use the internet, there are good changes that you are indirectly impacted from Hadoop/MapReduce from Google Search to FaceBook to LinkedIn etc. Here are some interesting links to find how widespread Hadoop/MR usage is
Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)
10 ways big data changes everything
One thing to note is Hadoop/MR is not an efficient solution for every problem. Consider other distributed programming models like those based on BSP also.
Happy Hadooping !!!

Here are some sample mapreduce examples which will be helpful for beginners..
1.Word Count
2.SQL Aggregation using Map reduce
3.SQL Aggregation on multiple fields using Map reduce
URL - http://hadoopdeveloperguide.blogspot.in/

Mapreduce for dummies

Ok, I am attempting to learn Hadoop and mapreduce. I really want to start with mapreduce and what I find are many, many simplified examples of mappers and reducers, etc. However, I seen to be missing something.
While an example showing how many occurrences of a word are in a document is simple to understand it does not really help me solve any "real world" problems. Does anybody know of a good tutorial of implementing mapreduce in a psuedo-realistic situation. Say, for instance, I want to use hadoop and mapreduce on top of a data store similar to Adventureworks. Now I want to get orders for a given product in the month of may. How would that look from a hadoop/mapreduce perspective? (I realize this may not be the type of problem mapreduce is intended to solve but, it just came to mind quickly.)
Any direction would help.

The book Hadoop: The Definitive Guide is a good place to start. The introductory chapters should be really useful to you to figure out where MapReduce is useful and when you should use it. The more advanced chapters have plenty of more realistic examples than word count.
If you want to dive deeper, you may want to check out Data-Intensive Text Processing with MapReduce. This definitely has plenty of "real-world" use cases, but it doesn't sound like you are interested in doing text processing.
For your particular example, the main things to realize are:
The map phase is mostly for parsing, transforming data, and filtering out data. Think record-by-record, shared-nothing approach to record processing. In word count, this is parsing the line and splitting out the words.
The reduce phase is all about aggregation: counting, averaging, min/max, etc. In word count, this is counting up the instances of the word.
So, if you would want all the records for a given product in the month of May, you could use a map-only job to filter through all the data and only keep the records you want. However, you really should read about what Hadoop is useful for. The question that would fit Hadoop better would be: give me a count of how many times every item was purchased in every month (to build a matrix, perhaps). Very rarely are you looking for specific records like you suggest.
If you are looking for a more real-time access platform, you should check out HBase once you are done learning about Hadoop.

Hadoop can be used for a wide variety of problems. Check this blog entry from atbrox. Also, there is a lot of information on the internet about Hadoop and MapReduce and it's easy to get lost. So, here is the consolidated list of resources on Hadoop.
BTW, Hadoop - The Definitive Guide 3rd edition is due in May. Looks like it also covers MRv2 (NextGen MapReduce) and also includes more case studies. The 2nd edition is worth as mentioned by orangeoctopus.

MapReduce can be a complex topic so I found it easier to understand it by applying its approach to a simple problem. Then I go on to describe how MapReduce makes it straightforward to solve the same problem in a cluster. You can take a look in my article here: Intro to Parallel Processing with MapReduce.
Let me know if you think this article makes it easier to understand MapReduce and Hadoop.

Hadoop Hypercube

Hey,
i am starting a hadoop based hypercube with a flexible number of dimensions.
Does anybody know any existing approaches for this?
I just found PigOLAPSketch, but there is no code to use it.
Another approach is Zohmg from lastfm, which uses hbase, but seems to be very dead.
I think i will start a pig solution, maybe you have some advices?

This would be very cool/useful. OpenTSDB is an HBase time-series database that might be interesting to look at, they have a clever approach to secondary indexing.

You can also look at gpu based database https://www.kinetica.com/
but this is not open source, requires separate appliances and movement of data from Hadoop to Kinetica infrastructure.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio