How to convert existing MapReduce applications to Crunch? - hadoop

I have several (about a dozen) MapReduce tasks implemented, each of which functions as part of a workflow executed by a simple bash script. For a variety of reasons, I would like to move the workflow to Apache Crunch.
However, it's not clear to me how to run my MapReduce tasks as Crunch functions without re-implementing them. Is there a straightforward way to use Map and Reduce implementations as Crunch functions? I would like to maintain the Tool implementations as well so the MapReduce tasks can be run both standalone and as part of the Crunch workflow; is there any way to do this?
Thanks for any insight.

For any who might stumble across this, there is a minimally documented API in the Crunch libs. However, it is fairly straightforward.
See here: https://crunch.apache.org/apidocs/0.10.0/org/apache/crunch/lib/Mapreduce.html

Related

Other paradigms beyond MapReduce

MapReduce is the most popular paradigm to process files in a distributed system. However the YARN architecture make possible the programer build applications using others paradigm.
Of course some paradigms are better than others for some applications.
For example:
I'm using MapReduce to process a video file but this _________ paradigm is better.
I'm using MapReduce to process sql query files but this ________ paradigm is better.
I want to know the best way to choose a paradigm more efficient according the problem.
I hope someone can understand the question.

Best programming way to implement Map Reduce

We have a problem, which is an ideal case for applying MapReduce programming technique. The initial code for this is written in Python. Now we have the following options:
Use Hadoop and Java to implement the MapReduce part.
Use mincemeat and Python to implement the MapReduce part.
Use Hadoop and Python (Hadoop MapReduce Program in Python) to implement the MapReduce part.
I'm not very sure which will be the best option. Can anyone please help ?
Since your initial code is in python and it doesn't make much of a difference whether writing MR in python or Java, (3) should be the best option to pursue for your scenario. You might also like to explore libraries like https://github.com/Yelp/mrjob which make it easier to write MR jobs in python.

Recursive Algorithm on Distributed Systems

Are there any generic systems/framework which allows to run recursive algorithms on distributed systems. Like hadoop can be used for batch processing , I am looking for framework which can enable to write recursive functions which can be executed on multiple systems.
I have already seen 1. Its just out of curiosity I am asking this.
Fork/Join should do it. Although the Java 7 implementation is the most well known, you can also apply the same pattern to a distributed system. Look here for a comparison with map-reduce.

does anyone find Cascading for Hadoop Map Reduce useful?

I've been trying Cascading, but I cannot see any advantage over the classic map reduce approach for writing jobs.
Map Reduce jobs gives me more freedom and Cascading seems to be putting a lot of obstacles.
Might make a good job for making simple things simple, but complex things.. I find them extremely hard
Is there something I'm missing. Is there an obvious advantage of Cascading over the classic approach?
In what scenario should I chose cascading over the classic approach? Any one using it and happy?
Keeping in mind I'm the author of Cascading...
My suggestion is to use Pig or Hive if they make sense for your problem, Pig especially.
But if you are in the business of data, and not just poking around your data for insights, you will find the Cascading approach makes much more sense for most problems than raw MapReduce.
Your first obstacle with raw MapReduce will be thinking in MapReduce. Trivial problems are simple in MapReduce, but its much easier to develop complex applications if you can work with a model that more easily maps to your problem domain (filter this, parse that, sort those, join the rest, etc).
Next you will realize that a normal unit of work in Hadoop consists of multiple MapReduce jobs. Chaining jobs together is a solvable problem but it should not leak into your application domain level code, it should be hidden and transparent.
Further, you will find refactoring and creating re-usable code much harder if you have to continually move functions between mappers and reducers. or from mappers to the previous reducer to get an optimization. Which leads to the issue of brittleness.
Cascading believes in failing fast as possible. The planner attempts to resolve and satisfy dependencies between all those field names before the Hadoop cluster is even engaged in work. This means 90%+ of all issues will be found before waiting hours for your job to find it during execution.
You can alleviate this in raw MapReduce code by creating domain objects like Person or Document, but many applications don't need all the fields down stream. Consider if you needed the average age of all males. You do not want to pay the IO penalty of passing a whole Person around the network when all you need is a binary gender and numeric age.
With fail fast semantics and lazy binding of sinks and sources, it becomes very easy to build frameworks on Cascading that themselves create Cascading flows (which become many Hadoop MapReduce jobs). A project I'm currently involved with ends up with 100's of MapReduce jobs per run, many created on the fly mid run based on feedback from the data being processed. Search for Cascalog to see an example of a Clojure based framework for simply creating complex processes. Or Bixo for a web mining toolkit and framework that's far easier to customize than Nutch.
Finally Hadoop is never used alone, that means your data is always pulled from some external source and pushed to another after processing. The dirty secret about Hadoop is that it is a very effective ETL framework (so its silly to hear ETL vendors talk about using their tools to push/pull data onto/from Hadoop). Cascading eases this pain somewhat by allowing you to write your operations, applications, and unit tests independent of the integration end-points. Cascading is used in production to load systems like Membase, Memcached, Aster Data, Elastic Search, HBase, Hypertable, Cassandra, etc. (Unfortunately not all the adapters have been released by their authors.)
If you will, please send me a list of the issues your are experiencing with the interface. I am constantly looking for better ways to improve the API and documentation, and the user community is always around to help.
I've been using Cascading for a couple of years now. I find it to be extremely helpful. Ultimately, it's about Productivity gains. I can be much more efficient in creating and maintaining M/R jobs as compared to plain java code. Here's a few reasons why:
A lot of the boilerplate code used to start a job is already written for you.
Composability. Generally code is easier to read and easier to reuse when it is written as components (operations) which are stitched together to perform some more complex processing.
I find unit testing to be easier. There are examples in the cascading package demonstrating how to write simple unit tests to directly test the output of flows.
The Tap (source and sink) paradigm makes it easy to change the input and ouput of a job, so you can, for example, start with output to STDOUT for development and debugging and then switch to HDFS sequencefiles for batch jobs and then switch to an HBase tap for pseudo-real time updates.
Another great advantage of writing Cascading jobs is that you're really writing more of a factory that creates jobs. This can be a huge advantage when you need to build something dynamically (i.e. the results of one job control what subsequent jobs you create and run). Or, in another case, I needed to create a job for each combination of 6 binary variables. This is 64 jobs which are all very similar. This would be a hassle with just hadoop map reduce classes.
While there are a lot of pre-built components that you can compose together, if a particular section of your processing logic seems like it would be easier to just write in straight Java, you can always create a Cascading function to wrap that. This allows you to have the benefits of Cascading, but very custom operations can be written as straight java functions (implementing a Cascading interface).
I used Cascading with Bixo to write the complete anti-spam link classification pipeline for a large social network.
The Cascading pipeline resulted in 27 MR jobs, which would have been very difficult to maintain in plain MR. I have written MR jobs before, but using something like Cascading feels like switching from Assembler to Java (insert_fav_language_here).
One of the big advantages over Hive or Pig IMHO is that Cascading is a single jar, which you bundle with your job. Pig and Hive have more dependencies (e.g. MySQL) or are not as easy to embed.
Disclaimer: While I know Chris Wensel personally, I really think Cascading is kick a**. Considering its complexity it is extremely impressive that I haven't found a single bug using it.
I teach the Hadoop Boot Camp course for Scale Unlimited, and also make extensive use of Cascading in Bixo and for building web mining apps at Bixo Labs - so I think I've got a good appreciation for both approaches.
The biggest single advantage I see in Cascading is that it allows you to think about your data processing workflow in terms of operations on fields, and to (mostly) avoid worrying about how to transpose this view of the world onto the key/value model that's intrinsically part of any map-reduce implementation.
The biggest challenge with Cascading is that it is a different way of thinking about data processing workflows, and there's a corresponding conceptual "hump" you need to get over before it all starts making sense. Plus the error messages can remind one of the output from lex/yacc ("conflict in shift/reduce") :)
-- Ken
I think that the place that Cascading's advantages begin to show are instances where you have a pile of simple functions that should all be kept separate in source code, but which can all be collected into a composition in your mapper or reducer. Putting them together makes your basic map-reduce code hard to read, separating them makes the program really slow. Cascading's optimizer can put them together even though you write them separately. Pig and to some extent Hive can do this as well, but for large programs, I think Cascading has a maintainability advantage.
In a few months Plume may be an expressivity competitor, but if you have real programs to write and run in a production setting, then Cascading is probably your best bet.
Cascading allows you to use simple field names and tuples in place of the primitive types offered by Hadoop which, "... tend to be at the wrong level of granularity for creating sophisticated, highly composable code that can be shared among different developers" (Tom White, Hadoop The Definitive Guide). Cascading was designed to solve those problems. Keep in mind, some of the applications like Cascading, Hive, Pig, etc, were developed in parallel and sometimes do the same thing. If you don't like Cascading or find it confusing, maybe you would be better of using something else?
I'm sure you already have this, but here is the user guide: http://www.cascading.org/1.1/userguide/pdf/userguide.pdf. It provides a decent walk through of the flow of data in a typical Cascading application.
I worked on cascading for couple of years and below are useful things in cascading.
1. code testability
2. easy integration with other tools
3. easily extensibile
4. you will focus only on business logic not on keys and values
5. proven in production and used by even twitter.
I recommend people use cascading most of the times.
Cascading is a wrapper around Hadoop that provides Taps and Sinks to and from Hadoop.
Writing Mappers and Reducers for all your tasks is going to be tedious. Try writing one Cascading job and then you're all set to avoiding writing any mappers and reducers.
You also want to look at cascading Taps and Schemes (this is how you input data into your cascading processing job).
With these two, i.e. Ability to avoid writing ad-hoc Hadoop Mappers with Reducers and the ability to consume a wide variety of data sources, you can solve a lot of your data processing very fast and effective.
Cascading is more than just a simple wrapper around hadoop, I am trying to keep the answer simple. For example, I've ported a huge mysql database containing terabytes of data to log files using cascading jdbc tap

Is it possible to perform arbitrary data analysis in Erlang?

I want to answer questions about data in Erlang: count things, correlate messages, provide arbitrary statistics. I had thought about resorting to Hadoop for this but is it possible to build a solution in raw Erlang to do rather arbitrary data analysis not necessarily via map/reduce but somehow? I have seen some hints of people doing this but no explicit blog posts or examples of this being done. I know that Powerset's natural language capabilities are written in Erlang. I also know about CouchDB but was looking for some other solutions.
Yes.
For general-purpose computation and statistics, Erlang works just fine. It isn't optimized heavily for such work, so it will have trouble keeping up with similar numeric code in, say MatLab, ForTran, or any of the major C package for this work -- but for most uses it will do just fine. And of course if your code parallelizes neatly and you have multiple CPUs available, Erlang will catch up more easily.
(You also mentioned the map/reduce pattern; it is relatively trivial given the Erlang/OTP runtime and libraries.)
I and my colleagues have written plenty of "raw" Erlang to do counting, statistics, and so on. We have found it to be more than sufficient for most tasks.

Resources