I was dusting off my copy of Numerical Recipes in C book and started wondering if a similar treatment exists for algorithms that can be palatalized using MapReduce and Hadoop.
Thanks.
There is a nice compilation from atbrox for the different algorithms in different domains. For text processing here is nice one another. Also, search # CiteSeerx.
Related
I like to learn Hadoop applications in the real world scenarios. Currently most of the example only cover the word count problem, and no any example on industrial use case.
Are there other Hadoop examples, or Hadoop tutorials out there, that solve other problem beside the word count problem?
See https://github.com/adamjshook/mapreducepatterns for source code examples that are documented in the book "Map Reduce Design Patterns" by Miner and Shook. I have tried them all and they all work in Hadoop using Cloudera's Training VMs.
Just to add a book that gives examples and brings simplicity and practicality to the real world!
Data Algorithms Recipes for Scaling Up with Hadoop and Spark By Mahmoud Parsian
I'm just trying to understand how Hadoop distinguish multiple files in HDFS. I want to do sentiment analysis using Hadoop (just a test). I have two files positive.json and negative.json. I'm trying to use Naive Bayes Classification. So, when I train the model, I want to know which ones are positive and which ones are negative. How do I do this? I haven't written any codes to show; I'm stuck in the first part. Any suggestions? I did read tons of papers, and I think I do have a basic concept. I want to see if I can use this concept in Rhipe. Or do you have any other better and easier solutions?
I have a program that I wish to compute by splitting it up across many computers. Is this something I can accomplish with Hadoop or Map/Reduce, and if so, how do I begin using it? Does it cost money to use that many computers?
You can split you program depending on the nature of your algorithm. You should split the input data and on each node apply your program on a subset of that input. That is, you should implement a data parallelism. In each node execute the same program but on a smaller input.
My advice, take a look to the book "Hadoop, the definitive guide", the first two chapters can help you to understand something better.
If you want to try simple MapReduce programs as WordCount you can download the Hortonworks sandbox which you can install on a virtual machine and you'll have a single node hadoop installation very quickly. Here is the link http://hortonworks.com/products/hortonworks-sandbox/
I am planning to do a MapReduce project involving Hadoop libraries and testing it on big data uploaded at AWS. I have not finalized an idea yet. But I am sure it will involve some kind of data processing, MapReduce design patterns and possibly Graph algorithms, Hive and PigLatin. I would really appreciate if someone can give me some ideas about it. I have few of mine in mind.
In the end I have to work on some large data set and get some information and derive some conclusions. For this I have used Weka before for data mining, (using Trees).
But I am not sure if that is the only thing I can work with right now (using Weka). Is there any other ways by which I can work on large data and derive conclusions on the large data set?
Also how can I involve graphs in this ?
Basically I want to make a research project but I am not sure what exactly I should be working on and what it should be like ? Any thoughts ? suggestive links/ideas ? Knowledge sharing ?
I will suggest you check Apache Mahout, it a scalable machine learning and data mining framework that should integrate nicely with Hadoop.
Hive gives you SQL-like language to query big data, essentially it translates your high-level query into MapReduce jobs and run it on the data cluster.
Another suggestion is to consider doing your data processing algorithm using R, it is a statistical software (similar to matlab), and I would recommend instead of the standard R environment is to use R Revolution, which is an environment to develop R, but with much powerful tools for big data and clustering.
Edit: If you are a student, R Revolution has a free academic edition.
Edit: A third suggestion, is to look at GridGain which is another Map/Reduce implementation in Java that is relatively easy to run on a cluster.
As you are already working with MapRedude and Hadoop, you can extract some knowledge from your data using Mahout or you can get some ideas from this very good book:
http://infolab.stanford.edu/~ullman/mmds.html
This books provide ideas to mine Social-Network Graphs, and works with graphs in a couple of other ways too.
Hope it helps!
Ok, I am attempting to learn Hadoop and mapreduce. I really want to start with mapreduce and what I find are many, many simplified examples of mappers and reducers, etc. However, I seen to be missing something.
While an example showing how many occurrences of a word are in a document is simple to understand it does not really help me solve any "real world" problems. Does anybody know of a good tutorial of implementing mapreduce in a psuedo-realistic situation. Say, for instance, I want to use hadoop and mapreduce on top of a data store similar to Adventureworks. Now I want to get orders for a given product in the month of may. How would that look from a hadoop/mapreduce perspective? (I realize this may not be the type of problem mapreduce is intended to solve but, it just came to mind quickly.)
Any direction would help.
The book Hadoop: The Definitive Guide is a good place to start. The introductory chapters should be really useful to you to figure out where MapReduce is useful and when you should use it. The more advanced chapters have plenty of more realistic examples than word count.
If you want to dive deeper, you may want to check out Data-Intensive Text Processing with MapReduce. This definitely has plenty of "real-world" use cases, but it doesn't sound like you are interested in doing text processing.
For your particular example, the main things to realize are:
The map phase is mostly for parsing, transforming data, and filtering out data. Think record-by-record, shared-nothing approach to record processing. In word count, this is parsing the line and splitting out the words.
The reduce phase is all about aggregation: counting, averaging, min/max, etc. In word count, this is counting up the instances of the word.
So, if you would want all the records for a given product in the month of May, you could use a map-only job to filter through all the data and only keep the records you want. However, you really should read about what Hadoop is useful for. The question that would fit Hadoop better would be: give me a count of how many times every item was purchased in every month (to build a matrix, perhaps). Very rarely are you looking for specific records like you suggest.
If you are looking for a more real-time access platform, you should check out HBase once you are done learning about Hadoop.
Hadoop can be used for a wide variety of problems. Check this blog entry from atbrox. Also, there is a lot of information on the internet about Hadoop and MapReduce and it's easy to get lost. So, here is the consolidated list of resources on Hadoop.
BTW, Hadoop - The Definitive Guide 3rd edition is due in May. Looks like it also covers MRv2 (NextGen MapReduce) and also includes more case studies. The 2nd edition is worth as mentioned by orangeoctopus.
MapReduce can be a complex topic so I found it easier to understand it by applying its approach to a simple problem. Then I go on to describe how MapReduce makes it straightforward to solve the same problem in a cluster. You can take a look in my article here: Intro to Parallel Processing with MapReduce.
Let me know if you think this article makes it easier to understand MapReduce and Hadoop.