I have a program that I wish to compute by splitting it up across many computers. Is this something I can accomplish with Hadoop or Map/Reduce, and if so, how do I begin using it? Does it cost money to use that many computers?
You can split you program depending on the nature of your algorithm. You should split the input data and on each node apply your program on a subset of that input. That is, you should implement a data parallelism. In each node execute the same program but on a smaller input.
My advice, take a look to the book "Hadoop, the definitive guide", the first two chapters can help you to understand something better.
If you want to try simple MapReduce programs as WordCount you can download the Hortonworks sandbox which you can install on a virtual machine and you'll have a single node hadoop installation very quickly. Here is the link http://hortonworks.com/products/hortonworks-sandbox/
Related
I have chosen my seminar topic on Hadoop Map reduce framework, and being a beginner I have very less knowledge about it.
My doubts are as follows :-
Whether the Job Tracker,Task Tracker comes as a software or Hardware?
As it was given in one of the examples, about map reduce been used to find occurrences of all words in some (say 3) files, then where do u give the query i.e "find occurrences of word" in map reduce program? How does map know what is to be done?
a) Whether the Job Tracker,Task Tracker comes as a software or Hardware?
Both are softwares. I.e. services running on nodes of a cluster.
b) As it was given in one of the examples, about map reduce been used to find occurrences of all words in some (say 3) files, then where do u give the query i.e "find occurrences of word" in map reduce program? How does map know what is to be done
There is no query language, unless you are using hive. You basically implement your logic inside Mapper & Reducer classes.
If you want to prepare for seminar go through Hadoop Definitive Guide, starting 2 chapters.
Get to know what kind of problems hadoop solves, its architecture.
Then if you do not understand anything, happy to help :)
Hi I came across this article and it made me wonder, how easy it would be for a hacker to crack passwords. What do you think guys???
If you want to try out several permutations in a brute force manner, I don't think that using hadoop would give you any benefit. Hadoop is not something that fits into all uses cases and would not every time perform well.
Computing permutations can be done in batch.. just set different start and end params for each machine. The overhead involved in setting a job, movement of data across nodes, job cleanup can surely be saved. I have seen that running different processes over 5 nodes be pre-dividing the load equally performed pretty well as compared to map-reduce. Offcourse, I dont mean that map-reduce is bad.. its just that the scenario wasnt right fit for getting the job done.
I found this Recursive Algorithm on Distributed Systems an interesting way to run recursive algorithms on distributed system. Now a permutation and combination algorithms can be used to do some interesting stuff
I was dusting off my copy of Numerical Recipes in C book and started wondering if a similar treatment exists for algorithms that can be palatalized using MapReduce and Hadoop.
Thanks.
There is a nice compilation from atbrox for the different algorithms in different domains. For text processing here is nice one another. Also, search # CiteSeerx.
I am working on an NLP project that can create entity sets and compute pairwise similarity from large-scale corpora. Currently I am using hadoop streaming and have implemented all the mappers and reducers in Python. Since the algorithm needs several rounds of map-reduce, I use Shell scripts to chain the jobs.
Now here are my concerns and what I want to do next:
[Concern 1]. Job chaining and job controls. Chaining hadoop streaming jobs is problematic. If job2 in the job sequence (job1-job2-job3) failed, I have to manually delete the output folder, adjust the script file that launches the jobs and re-run the job sequence from the middle. I really hope to find a smarter way to do this. As I need to frequently adjust the parameters and the logic of the algorithms, I don't want to repeat those steps again and again.
[Concern 2]. Speed and efficiency. I think a great proportion of the time costs is for parsing the text into numbers and such kind of format-conversion processes which are totally unnecessary. It takes a lot to time to type and test those near-duplicate Python codes during development, too.
[Concern 3]. Easiness to maintain and distribute. As the project is getting big (I am having more than 20 map-reduce jobs), it really needs to modularize the whole project. I want to make it object-oriented, and use IDE to develop and maintain the project, so that it would be more comfortable handling the various internal data structure and format. And also I want to distribute my project as a package, so that other people may benefit from it. I need an easy way to import the whole project in an IDE and make it easy to distribute.
I've done some research on the possibilities of solutions:
Alternative 1. Hadoop custom jars: It seems that the best way to do this is to convert the entire project into java - a hadoop custom jar. This may fix all the problems, including job chaining, efficiency and maintenance issue. But it may take quite much time, and I have found a way to do efficient debugging.
Alternative 2. Pig: I found the answer to this question to be quite helpful in figuring out when to (not) use pig. In the answer, Arun_suresh says if "you have some very specific computation you need to do within your Map/reduce functions … then you should consider deploying your own jars". My job includes shingling, hashing, min-hashing, permutation, etc. Can it be implemented using Pig Latin? Is there a place that I can get an idea about how complex computations can Pig Latin programs have?
Alternative 3. Mahout: I found that the newly released Apache Mahout versions has several functions that overlap with what I am doing, but it cannot replace my work. Shall I base my project on Mahout?
Since I am basically on my own to do this job, and only have about 2 weeks' time budget to do the whole housekeeping work, and about 1 month to improve it, I really need to find an efficient and reliable way to do it. Please help me choose one of the alternatives or tell me if you have a better solution.
Concern 1: Look into using OOZIE, you can define the entire job chain in a 'workflow', with a number of actions (they can be Map reduce steps, decision switches / joins).
Each action can define a prepare stage where you can delete the upcoming output folder.
You are also able to see where in the workflow failure occured, and re-run the workflow from that particular step.
The OOZIE documentation has examples of running streaming jobs, and some cookbook examples too (Case 7)
Concern 2: This is a problem shared by all streaming jobs, if you want the best performance you'll either need to bite the Java bullet, or seek alternative stream input / output formats that can handle binary formats (such as AVRO), and you'll need some custom mapper / reducers stream wrappers that know how to handle these.
Concern 3: With streaming, you can bundle all the python scripts into a tarball, and have hadoop unpack the tarball to every task node. This is a function of the DistributedCache archives feature (in OOZIE this is defined in the archives element of the map-reduce action - see Case 6 in the previous OOZIE cookbook link)
Ok, I am attempting to learn Hadoop and mapreduce. I really want to start with mapreduce and what I find are many, many simplified examples of mappers and reducers, etc. However, I seen to be missing something.
While an example showing how many occurrences of a word are in a document is simple to understand it does not really help me solve any "real world" problems. Does anybody know of a good tutorial of implementing mapreduce in a psuedo-realistic situation. Say, for instance, I want to use hadoop and mapreduce on top of a data store similar to Adventureworks. Now I want to get orders for a given product in the month of may. How would that look from a hadoop/mapreduce perspective? (I realize this may not be the type of problem mapreduce is intended to solve but, it just came to mind quickly.)
Any direction would help.
The book Hadoop: The Definitive Guide is a good place to start. The introductory chapters should be really useful to you to figure out where MapReduce is useful and when you should use it. The more advanced chapters have plenty of more realistic examples than word count.
If you want to dive deeper, you may want to check out Data-Intensive Text Processing with MapReduce. This definitely has plenty of "real-world" use cases, but it doesn't sound like you are interested in doing text processing.
For your particular example, the main things to realize are:
The map phase is mostly for parsing, transforming data, and filtering out data. Think record-by-record, shared-nothing approach to record processing. In word count, this is parsing the line and splitting out the words.
The reduce phase is all about aggregation: counting, averaging, min/max, etc. In word count, this is counting up the instances of the word.
So, if you would want all the records for a given product in the month of May, you could use a map-only job to filter through all the data and only keep the records you want. However, you really should read about what Hadoop is useful for. The question that would fit Hadoop better would be: give me a count of how many times every item was purchased in every month (to build a matrix, perhaps). Very rarely are you looking for specific records like you suggest.
If you are looking for a more real-time access platform, you should check out HBase once you are done learning about Hadoop.
Hadoop can be used for a wide variety of problems. Check this blog entry from atbrox. Also, there is a lot of information on the internet about Hadoop and MapReduce and it's easy to get lost. So, here is the consolidated list of resources on Hadoop.
BTW, Hadoop - The Definitive Guide 3rd edition is due in May. Looks like it also covers MRv2 (NextGen MapReduce) and also includes more case studies. The 2nd edition is worth as mentioned by orangeoctopus.
MapReduce can be a complex topic so I found it easier to understand it by applying its approach to a simple problem. Then I go on to describe how MapReduce makes it straightforward to solve the same problem in a cluster. You can take a look in my article here: Intro to Parallel Processing with MapReduce.
Let me know if you think this article makes it easier to understand MapReduce and Hadoop.