Can PIG and HIVE be called separate programming models? - hadoop

This question might sound irritating, and may not actually have anything to do with real programming. It's a spin-off of a small debate i had with a colleague of mine. He kept insisting that HIVE and PIG can be called as separate "programming models" because when you write MapReduce jobs in these, you don't really need to think in MapReduce - especially if you're programming in HIVE. From the programmer's perspective, the MapReduce part is completely abstracted. It's completely SQL-like.
But I kind of disagreed, because the scripts written in these languages ultimately get converted into multiple mapreduce jobs in the end. So, these can be called as higher level programming languages to program for the same model. And the word programming model should be looked at from the point of view of the underlying data waiting to be crunched, and the not the programmer.
What's your opinion?

I would define it as following: HIVE and PIG are on the different abstraction levels. The closes analogy is SQL and query execution plan. They both solve the same problem of database query, but SQL is declarative and state what should be a result, while query execution plan specify actions to achieve it. In this example HIVE is SQL and PIG - query execution plan.
We can state that HIVE is declarative and sit on higher level of abstraction then imperative PIG. So logically HIVE request can be translated to the PIG.

Related

Are some Pig real time use cases available?

Please provide me real time Pig use cases. Banking and healthcare would be of great help. Also curious if Pig can be used as a ETL tool in Hadoop world.
Pig is typically a batch processing tool. But I'm not sure what do you refer for when you ask for "real time Pig use cases".
ETL - basically anything can be used for ETL purpose what can Extract Transform Load pig can do that. We're using it in batch workflows for ETL.
You can find few POC for understaning the Usage of PIG in below link
http://ybhavesh.blogspot.in/
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
I can recommend the book entitled "Pig design patterns" by Pradeep Pasupuleti for some useful examples (with source code included)

Tutorial on performance analysis of pig and hive scripts

I am looking for good tutorials on doing performance analysis and improvement of pig latin scripts and hive scripts.
I'm not aware of any such tutorial. The only good way in my view is to do it yourself keeping your data and your case in mind.
Having said that, you can make use of something like TPC-H to benchmark your queries and based on the results you can improve and optimize your Pig and Hive queries, in case you find some performance bottlenecks. This will also help you in figuring out what Pig and Hive are not good at. Also, you can compare both the tools in case your are confused which one to go for for a particular task.
You can find more on this by visiting the below specified links :
Running TPC-H Benchmark on Pig Ticket.
Running TPC-H Benchmark on Pig Ticket.
And if you need all the details, you can visit the original papers on Running TPC-H on Pig and Hive. These papers contain great deal of information and you will definitely find them helpful during the process.
HTH
I'm not sure if it's what you're looking for but Big Data University has some pretty good tutorials on Hive and Pig. Give it a shot. You'll need the IBM QuickStart VM. Its a huge download but its free and pretty good.
Link:
http://www-01.ibm.com/software/data/infosphere/biginsights/quick-start/
There's tutorials on the VM as well that are pretty good but i think the ones at BigDataUni are better.
In case it matters, I registered on both websites and haven't gotten any spam or anything.

My Database Design skills stink. Where to seek remedy?

I have a web site that's been progressivelly expanding in both traffic and complexity of database design. I've always worked as a developer first & foremost, and never really been much of a DB administrator beyond what I need to do to get my code running. This needs to change - I need to improve efficiency on the database side of things.
To give a vague example, I'm looking for how to go about learning:
Optimising complex tables/relationships for performance/scaling
How to index efficiently. (At the moment I throw indexes on foreign keys, and that's about it)
General design principles for complex databases
Most of the resources I've found are either directed more towards the basics of SQL ("this is a SELECT query, a JOIN, etc") or focus primarily on performance issues outside the DB.
So, I know this is a little vague - but where should I look to ensure my database is designed in the most most efficient & integral manner possible?
Learn about data modeling. Choosing the right data structure is always a crucial first step, for programming in general and databases in particular. Performance cannot be "bolted" on top of a bad data structure! The ERwin Methods Guide is probably not a bad way to start learning about data modeling.
Learn how DBMSes organize data at the physical level. This will help you immensely in understanding how to "shape" your data for performance and how to effectively leverage many of the performance mechanisms modern DBMSes put at your disposal. Use The Index, Luke! is an excellent tutorial on the topic.
Learn how to efficiently access the database and make sure you really understand the client API that will be called from your code. Different APIs have their own idiosyncrasies, but they all share some common themes, such as parameter binding, query preparation and fetching. Even if you are "shielded" by an ORM from ever having to, say, bind parameters manually, this is still taking place "under the covers" and understanding it raises your ability to write performant code.
Measure, measure, measure. Modern information systems are immensely complex and even experts find themselves making incorrect assumptions, so don't rely on assumptions!
I would suggest some reading in performance tuning. It is very specialized depending on the database backend you use. BUt here are some books to consider:
SQl Server
http://www.amazon.com/Server-Query-Performance-Tuning-Distilled/dp/1590594215/ref=sr_1_2?s=books&ie=UTF8&qid=1334154710&sr=1-2
http://www.amazon.com/Performance-Tuning-Server-Dynamic-Management/dp/1906434476/ref=sr_1_12?s=books&ie=UTF8&qid=1334154710&sr=1-12
MySQL
http://www.amazon.com/High-Performance-MySQL-Optimization-ebook/dp/B0028N4W7Y/ref=sr_1_3?ie=UTF8&qid=1334154504&sr=8-3
Oracle
http://www.amazon.com/Oracle-Database-Release-Performance-Techniques/dp/0071780262/ref=sr_1_2?s=books&ie=UTF8&qid=1334154909&sr=1-2
General performance Tuning
http://www.amazon.com/SQL-Performance-Tuning-Peter-Gulutzan/dp/0201791692/ref=sr_1_18?s=books&ie=UTF8&qid=1334154964&sr=1-18
First and foremost, I'd recommend learning how to use EXPLAIN and what its output means. Run it on your most common queries and study the output. Are the queries using sensible indexes? Are they using indexes at all? Queries that look very simple at a glance might end up being quite costly.
Next, I'd suggest finding your slowest queries. Postgres (for example) has a feature that allows you to log the SQL source for all queries that take longer than N seconds to run. Are they slow because they're unindexed, very complex, or operating on a huge amount of data?
Third, I'd look at the number of times a particular query is run. Are you using the database to store static data, and hitting a table over and over again to grab a record that never changes? You could probably cache the result somewhere.

does anyone find Cascading for Hadoop Map Reduce useful?

I've been trying Cascading, but I cannot see any advantage over the classic map reduce approach for writing jobs.
Map Reduce jobs gives me more freedom and Cascading seems to be putting a lot of obstacles.
Might make a good job for making simple things simple, but complex things.. I find them extremely hard
Is there something I'm missing. Is there an obvious advantage of Cascading over the classic approach?
In what scenario should I chose cascading over the classic approach? Any one using it and happy?
Keeping in mind I'm the author of Cascading...
My suggestion is to use Pig or Hive if they make sense for your problem, Pig especially.
But if you are in the business of data, and not just poking around your data for insights, you will find the Cascading approach makes much more sense for most problems than raw MapReduce.
Your first obstacle with raw MapReduce will be thinking in MapReduce. Trivial problems are simple in MapReduce, but its much easier to develop complex applications if you can work with a model that more easily maps to your problem domain (filter this, parse that, sort those, join the rest, etc).
Next you will realize that a normal unit of work in Hadoop consists of multiple MapReduce jobs. Chaining jobs together is a solvable problem but it should not leak into your application domain level code, it should be hidden and transparent.
Further, you will find refactoring and creating re-usable code much harder if you have to continually move functions between mappers and reducers. or from mappers to the previous reducer to get an optimization. Which leads to the issue of brittleness.
Cascading believes in failing fast as possible. The planner attempts to resolve and satisfy dependencies between all those field names before the Hadoop cluster is even engaged in work. This means 90%+ of all issues will be found before waiting hours for your job to find it during execution.
You can alleviate this in raw MapReduce code by creating domain objects like Person or Document, but many applications don't need all the fields down stream. Consider if you needed the average age of all males. You do not want to pay the IO penalty of passing a whole Person around the network when all you need is a binary gender and numeric age.
With fail fast semantics and lazy binding of sinks and sources, it becomes very easy to build frameworks on Cascading that themselves create Cascading flows (which become many Hadoop MapReduce jobs). A project I'm currently involved with ends up with 100's of MapReduce jobs per run, many created on the fly mid run based on feedback from the data being processed. Search for Cascalog to see an example of a Clojure based framework for simply creating complex processes. Or Bixo for a web mining toolkit and framework that's far easier to customize than Nutch.
Finally Hadoop is never used alone, that means your data is always pulled from some external source and pushed to another after processing. The dirty secret about Hadoop is that it is a very effective ETL framework (so its silly to hear ETL vendors talk about using their tools to push/pull data onto/from Hadoop). Cascading eases this pain somewhat by allowing you to write your operations, applications, and unit tests independent of the integration end-points. Cascading is used in production to load systems like Membase, Memcached, Aster Data, Elastic Search, HBase, Hypertable, Cassandra, etc. (Unfortunately not all the adapters have been released by their authors.)
If you will, please send me a list of the issues your are experiencing with the interface. I am constantly looking for better ways to improve the API and documentation, and the user community is always around to help.
I've been using Cascading for a couple of years now. I find it to be extremely helpful. Ultimately, it's about Productivity gains. I can be much more efficient in creating and maintaining M/R jobs as compared to plain java code. Here's a few reasons why:
A lot of the boilerplate code used to start a job is already written for you.
Composability. Generally code is easier to read and easier to reuse when it is written as components (operations) which are stitched together to perform some more complex processing.
I find unit testing to be easier. There are examples in the cascading package demonstrating how to write simple unit tests to directly test the output of flows.
The Tap (source and sink) paradigm makes it easy to change the input and ouput of a job, so you can, for example, start with output to STDOUT for development and debugging and then switch to HDFS sequencefiles for batch jobs and then switch to an HBase tap for pseudo-real time updates.
Another great advantage of writing Cascading jobs is that you're really writing more of a factory that creates jobs. This can be a huge advantage when you need to build something dynamically (i.e. the results of one job control what subsequent jobs you create and run). Or, in another case, I needed to create a job for each combination of 6 binary variables. This is 64 jobs which are all very similar. This would be a hassle with just hadoop map reduce classes.
While there are a lot of pre-built components that you can compose together, if a particular section of your processing logic seems like it would be easier to just write in straight Java, you can always create a Cascading function to wrap that. This allows you to have the benefits of Cascading, but very custom operations can be written as straight java functions (implementing a Cascading interface).
I used Cascading with Bixo to write the complete anti-spam link classification pipeline for a large social network.
The Cascading pipeline resulted in 27 MR jobs, which would have been very difficult to maintain in plain MR. I have written MR jobs before, but using something like Cascading feels like switching from Assembler to Java (insert_fav_language_here).
One of the big advantages over Hive or Pig IMHO is that Cascading is a single jar, which you bundle with your job. Pig and Hive have more dependencies (e.g. MySQL) or are not as easy to embed.
Disclaimer: While I know Chris Wensel personally, I really think Cascading is kick a**. Considering its complexity it is extremely impressive that I haven't found a single bug using it.
I teach the Hadoop Boot Camp course for Scale Unlimited, and also make extensive use of Cascading in Bixo and for building web mining apps at Bixo Labs - so I think I've got a good appreciation for both approaches.
The biggest single advantage I see in Cascading is that it allows you to think about your data processing workflow in terms of operations on fields, and to (mostly) avoid worrying about how to transpose this view of the world onto the key/value model that's intrinsically part of any map-reduce implementation.
The biggest challenge with Cascading is that it is a different way of thinking about data processing workflows, and there's a corresponding conceptual "hump" you need to get over before it all starts making sense. Plus the error messages can remind one of the output from lex/yacc ("conflict in shift/reduce") :)
-- Ken
I think that the place that Cascading's advantages begin to show are instances where you have a pile of simple functions that should all be kept separate in source code, but which can all be collected into a composition in your mapper or reducer. Putting them together makes your basic map-reduce code hard to read, separating them makes the program really slow. Cascading's optimizer can put them together even though you write them separately. Pig and to some extent Hive can do this as well, but for large programs, I think Cascading has a maintainability advantage.
In a few months Plume may be an expressivity competitor, but if you have real programs to write and run in a production setting, then Cascading is probably your best bet.
Cascading allows you to use simple field names and tuples in place of the primitive types offered by Hadoop which, "... tend to be at the wrong level of granularity for creating sophisticated, highly composable code that can be shared among different developers" (Tom White, Hadoop The Definitive Guide). Cascading was designed to solve those problems. Keep in mind, some of the applications like Cascading, Hive, Pig, etc, were developed in parallel and sometimes do the same thing. If you don't like Cascading or find it confusing, maybe you would be better of using something else?
I'm sure you already have this, but here is the user guide: http://www.cascading.org/1.1/userguide/pdf/userguide.pdf. It provides a decent walk through of the flow of data in a typical Cascading application.
I worked on cascading for couple of years and below are useful things in cascading.
1. code testability
2. easy integration with other tools
3. easily extensibile
4. you will focus only on business logic not on keys and values
5. proven in production and used by even twitter.
I recommend people use cascading most of the times.
Cascading is a wrapper around Hadoop that provides Taps and Sinks to and from Hadoop.
Writing Mappers and Reducers for all your tasks is going to be tedious. Try writing one Cascading job and then you're all set to avoiding writing any mappers and reducers.
You also want to look at cascading Taps and Schemes (this is how you input data into your cascading processing job).
With these two, i.e. Ability to avoid writing ad-hoc Hadoop Mappers with Reducers and the ability to consume a wide variety of data sources, you can solve a lot of your data processing very fast and effective.
Cascading is more than just a simple wrapper around hadoop, I am trying to keep the answer simple. For example, I've ported a huge mysql database containing terabytes of data to log files using cascading jdbc tap

Is it stupid to write a large batch processing program entirely in PL/SQL?

I'm starting work on a program which is perhaps most naturally described as a batch of calculations on database tables, and will be executed once a month. All input is in Oracle database tables, and all output will be to Oracle database tables. The program should stay maintainable for many years to come.
It seems straight-forward to implement this as a series of stored procedures, each performing a sensible transformation, for example distributing costs among departments according to some business rules. I can then write unit tests to check if the output of each transformation is as I expected.
Is it a bad idea to do this all in PL/SQL? Would you rather do heavy batch calculations in a typical object oriented programming language, such as C#? Isn't it more expressive to use a database centric programming language such as PL/SQL?
You describe the following requirements
a) Must be able to implement Batch Processing
b) Result must be maintainable
My Response:
PL/SQL was designed to achieve just what you describe. It's also important to note that there are efficiencies in PL/SQL that are not available in other tools. An stored procedure language put the processing next to the data - which is where batch processing ought to sit.
It easy enough to write poorly maintainable code in any language.
Having said the above, your implementation will depend on the available skills, a proper design and adherence to good quality processes.
To be efficient your implementation must process data in batches ( select in batches and insert/update in batches ). The danger with an OO approach is that it is easy to be led towards a design that processes data row by row. This type of approach contains unnecessary overhead, and will be significantly less efficient than a design that processes data in batches of rows.
It is possible to use both approaches successfully.
Mathew Butler
Something for other commenters to note - the question is about PL/SQL, not about SQL. Some of the answers have obviously been about SQL, not PL/SQL. PL/SQL is a fully functional database language, and it's mature as well. There are some shortcomings, but for the type of thing the poster wants to do, it's very good.
No, it isn't necessarily a bad idea. If the solution seems straightforward to you and allows you to test and verify each process, its sounds like it could be a good idea. OO platforms can be (though they don't have to be) bad for large data sets, as object creation and overhead can kill performance.
Oracle designed PL/SQL with problems like yours in mind, if there is sufficient corporate knowledge of the database and PL/SQL this seems like a reasonable solution. Keep large batch sets in mind, as each call from PL/SQL to the actual SQL engine is a context switch, so single record processes should be batched together where possible to improve performance.
Just make sure you somehow log what is happening while it's working. Otherwise you'll have a black box and if it gets stuck somewhere for hours, you'll be wondering whether to stop it or let it work 'a little bit more'.
PL/SQL is a mature language that integrates well with SQL. With each version of Oracle it becomes more and more powerful.
Also starting from Oracle 11, PL/SQL compiles to machine code by default.
Normally I say put as little in PL/SQL as possible - it is typically a lot less maintainable - at one of my last jobs I really saw how messy and hard to work with it could get.
However, since it is batch processing - and since the input and output are both the DB - it makes good sense to put the logic into PL/SQL - to minimize "moving parts". However, if it were business logic - or components used by other pieces of your system - I would say don't do it..
I wrote a huge amount of batch processing and report generation programs in both PL/SQL and ProC for one project. They generally preferred I write in PL/SQL as their own developers who would maintain in the future found that easier to understand than ProC code.
It ended up being only the really funky processing or reports that ended up being written in Pro*C.
It is not necessary to write these as stored procedures as other people have alluded to, they can be just script files that are run as necessary, kind of like a shell script. Make source code revision control and migration between test and production systems a heck of a lot easier, too.
As long as the calculations you need to perform can be adequately AND readably captured in PL/SQL, then using only PL/SQL would make the most sense.
The real catch is maintainability -- it's very easy to write unmaintainable SQL, if only because every RDBMS has a different syntax and different function set once you step outside of simple SQL DML, and no real standards for formatting. commenting, etc.
I've created batch programs using C# and SQL.
Pros of C#:
You've got the full library of .NET and all the power of an OO
language.
Cons of C#:
Batch program and db separate - this means, you'll have to manage your batch program separate from the database.
You need to escape all that dang sql code
Pros of SQL:
Integrates nicely with the DBMS. If this job only manipulates the database, it would make sense to include it with the database. You end up with a single db and all of its components in one package.
No need to escape sql code
keeping it real - you are programming in your problem domain
Cons of SQL:
Its SQL and I personally just don't know it as well as C#.
In general, I would stick with using SQL because of the Pros outlined above.
This is a loaded question :)
There's a couple of database programming architecture designs you should know of, and what their costs/benefits are.
2 Tier generally means you have a client connecting to a DB, issuing direct SQL calls.
3 Tier generally means you have an "application server" that is issuing direct SQL calls to the DB, but the client is talking to the app server. Generally, this affords "scaling out".
Finally, you have 2 1/2 tiered apps that employ a 2 Tier like format, only the work is compartmentalized within stored procedures.
Your process sounds like a "back office" kind of thing, and clients/processes just need results that are being aggregated and cached on a once a month basis.
That is, there is no agent that connects, and connects often, and says "do these calculations". Instead you allude to a process that happens once in a while, and you can get away with non-real time.
Therefore, given those requirements, I'd say that generally, it will be faster to be closer to the data, and let SQL server do all the calculations.
I think you'll find that proximity to the data will serve you well.
However, in performing these calculations, you may find that some calculations are not amenable to SQL Servers. Take for example calculating the accrued interest of a bond, or any fixed income instrument. Not very pretty in SQL, and much more suited for a richer programming language. However, if you just have simple averages and other relatively sane aggregates, I'd stick to stored procedures, on the SQL side.
So again, there's not enough information as to the nature of your calculations, or what your house mandates in terms of SQL capabilities of devs for support, or what your boss says...but since I know my way around SQL, and like to stay close to the data, I'd stay pure SQL/Stored Procedures for a task like this.
YMMV :)
It's not usually more expressive because most stored procedure languages suck by design. But it will probably run faster than in an external app.
I guess it boils down to how familiar you are with PL/SQL, how much time you have to write this, how important is performance and if you can reasonably expect maintainers to be familiar enough with PL/SQL to maintain a big program written in it.
If speed is not relevant and maintainers will probably be not PL/SQL proficient, you might be better using a 'traditional' language.
You could also use a hybrid approach, where you use PL/SQL to generate intermediate data (say, table joins and sums or whatever) and a separate application to control flow and check values and errors.

Resources