MR code from Hive - hadoop

I am learning Hadoop and was learning about Hive in recent past. Its like a SQL query which can be used instead of MR java code. My tutor said that Hive produces MR java code behind the scene, which in turn is exported as jar file and then it is ran on top of HDFS filesystem. But, he was not sure if we can get the MR code produced by Hive. I was reading Definitive guide and nothing was discussed about it. I wanted to know is it really so? As I think if we can get the MR code then that code can be refined further to achieve many other things (means can be customized for other tasks).
I know I am just in learning stage and should wait before asking or jumping to conclusions but was curious. I think if even above concept is correct then catching MR code should be possible. I am .Net C# developer and not familiar with Java and not an expert in Relational DBs, wanted to tell only if it matters.

Related

Talend and Apache Spark?

I am confused as to where Talend and Apache spark fit in the big data ecosystem as both Apache Spark and Talend can be used for ETL.
Could someone please explain this with an example?
Talend is kind of tool based big data approach and supports all big data applications with in built components. Where as spark is code base approach and you need to write the code for use case.
In fact Talend Big Data studio generates Apache Spark code for the designed ETL jobs. So in essence they are the same.
Talend studio provides inbuilt components for spark is the main engine behind this. Due to the inbuilt components it decreases the coding time. But if you will code directly using spark with Scala java or python it needs time to build the common components. Talend makes life easier and it’s easy to adopt for the traditional etl developers. Example if someone comes from abi initio they can correlate by seeing the graph or lineage provided by Talend. But in order to extend the business component people need to write code I. Java with spark in the Talend studio. One more thing Talend takes care of packaging the jar and deploying it from windows to server and run and displays the result in its console.

ETL vs Workflow Management and which to apply? Can they be used the same?

I am in the process of setting up a data pipeline for a client. I've spent a number of years being on the analysis side of things but now I am working with a small shop that only really has a production environment.
The first thing we did was to create a replicated instance of production but I would like to apply a sort of data warehouse mentality to make the analysis portion easier.
My question comes down to what tool to use? Also, why? I have been looking at solutions like Talened for ETL but also am very interested in Airflow. The problem is that I'm not quite sure which suits my needs better. I would like to monitor and create jobs easily (I write python pretty fluently so Airflow job creation isn't an issue) but also be able to transform the data as it comes in.
Any suggestions are much appreciated
Please consider that the open source of talend (Talend Open Studio) does not provide any monitoring / scheduling capabilities. It is only "code generator". The more sophisticated infrastructure is part of the enterprise editions.
For anyone that sees this. Four years later and what we have done is leverage Airflow for scheduling, Fivetran and/or Sticher for extraction and loading, and dbt for transformations.

Reverse engineering DataStage code into Pig (for Hadoop)

I have a landscape of datastage applications which I want to reverse engineer into Pig... Rather than having to write fresh Pig code and try to replicate the datastage functionality.
Has anyone had experience of doing something similar?
Any tips on the best approach would be much appreciated.
What you want is a code migration from DataStage to Pig.
This can be done with a program transformation system, which are designed to parse/analyze/transform complex software systems.
You can learn more about the issues of using such a tool
at https://stackoverflow.com/a/3460977/120163

Learning Hadoop for System Admin

This is not a technical question, but want to have suggestions from more experienced people regarding my career.
I have been working as UNIX admin from past 13 years, majority of Solaris and couple of years on Linux. Now, I want to learn something more which can excel my career. I have been hearing a lot about Hadoop/Big Data from quite sometime. I do not have any programming or scripting knowledge, neither have knowledge of apache or any database.
- I am assuming that there are two different job profile, Developer and Admin. Am I understanding it correctly ?
- Do I need to learn apache, database, java to learn Hadoop (Even for Admin job profile) ?
- At my place training is expensive. if I want to start study with books, which book should I start with ? I can see popular ones are "Hadoop: The Definite Guide - O'Reilly" and also "Big Data for Dummies". (I am asking from beginners level).
Please help with my doubts. Your suggestions will help me to take decision.
(Moved from comment because too long.)
In order to administer Hadoop in any meaningful way you need to know a fair bit about (a) how Hadoop works, (b) how Hadoop runs its jobs, and (c) job-specific tuning.
I don't know what "learning Apache" means; Apache is a conglomerate of projects, unless you mean the web server itself.
"Learning databases" is too broad to be useful, and Hadoop isn't a database (HBase is).
You don't need any Java knowledge to administer a Java-based program, although knowing about JVM options, how to specify them, and generalities is certainly helpful.
There is a lot to digest, I would start very small, e.g., intro books. Also, keep in mind that there are other solutions besides Hadoop, and a lot of different ways to actually use Hadoop.
The Kiji project is a good way to get Hadoop/HBase/etc up and running, though if you're interested in doing everything "from scratch", it's not the best path.

How to start learning hadoop [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am a Web developer. I have experience in Web technologies like JavaScript , Jquery , Php , HTML . I know basic concepts of C. Recently I had taken interest in learning more about mapreduce and hadoop. So I enrolled my self in parallel data processing in mapreduce course in my university. Since I dont have any prior programing knowledge in any object oriented languages like Java or C++ , how should I go about learning map reduce and hadoop. I have started to read Yahoo hadoop tutorials and also OReilly's Hadoop The Definitive Guide 2nd.Edition.
I would like you guys to suggest me ways I could go about learning mapreduce and hadoop.
Here are some nice YouTube videos on MapReduce
http://www.youtube.com/watch?v=yjPBkvYh-ss
http://www.youtube.com/watch?v=-vD6PUdf3Js
http://www.youtube.com/watch?v=5Eib_H_zCEY
http://www.youtube.com/watch?v=1ZDybXl212Q
http://www.youtube.com/watch?v=BT-piFBP4fE
Also, here are nice tutorials on how to setup Hadoop on Ubuntu
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
You can access Hadoop from many different languages and a number of resources set up Hadoop for you. You could try Amazon's Elastic MapReduce (EMR), for instance, without having to go through the hassle of configuring the servers, workers, etc. This is a good way to get your head around MapReduce processing while delaying a bit the issues of learning how to use HDFS well, how to manage your scheduler, etc.
It's not hard to search for your favorite language & find Hadoop APIs for it or at least some tutorials on linking it with Hadoop. For instance, here's a walkthrough on a PHP app run on Hadoop: http://www.lunchpauze.com/2007/10/writing-hadoop-mapreduce-program-in-php.html
Answer 1 :
It is very desirable to know Java. Hadoop is written in Java. Its popular Sequence File format is dependent on Java.
Even if you use Hive or Pig, you'll probably need to write your own UDF someday. Some people still try to write them in other languages, but I guess that Java has more robust and primary support for them.
Most Hadoop tools are not mature enough (like Sqoop, HCatalog and so on), so you'll see many Java error stack traces and probably you'll want to hack the source code someday
Answer 2
It is not required for you to know Java.
As the others said, it would be very helpful depending on how complex your processing may be. However, there is an incredible amount you can do with just Pig and say Hive.
I would agree that it is fairly likely you will eventually need to write a user defined function (UDF), however, I've written those in Python, and it is very easy to write UDFs in Python.
Granted, if you have very stringent performance requirements, then a Java based MapReduce program would be the way to go. However, great advancements in performance are being made all of the time in both Pig and Hive.
So, the short answer to your question is, "No", it is not required for you to know Java in order to perform Hadoop development.
Source :
http://www.linkedin.com/groups/Is-it-must-Hadoop-Developer-988957.S.141072851
1) Learn Java. No way around that, sorry.
2) Profit! It'll be very easy after that -- Hadoop is pretty darn simple.
It sounds like you are on the right track. I recommend setting up some Virtual Machines on your home computer to start taking what you see in the books and implementing them in your VMs. As with many things the only way to become better at something is to practice it. Once you get into I am sure you will have enough knowledge to start a small project to implement Hadoop with. Here are some examples of things people have built with Hadoop: Powered by Hadoop
Go through the Yahoo Hadoop tutorial before going through Hadoop the definitive guide. The Yahoo tutorial gives you a very clean and easy understanding of the architecture.
I think the concepts are not arranged properly in the Book. That makes it a little difficult to study it.
So do not study it together. Go through the web tutorial first.
I just put together a paper on this topic. Great resources above, but I think you'll find some additional pointers here: http://images.globalknowledge.com/wwwimages/whitepaperpdf/WP_CL_Learning_Hadoop.pdf
Feel free to join my blog about Big Data - https://oyermolenko.blog. I’ve been working with Hadoop for a couple of years and in this blog want to share my experience from the early start. I came from .NET environment and faced a couple of challenges related to switching from one language into another. My blog is oriented on people who didn’t work with Hadoop but have some primary technical background like you. Step by step I want to cover the whole family of Big Data services, describe the concepts and common problems I met working with them. Hope you will enjoy it

Resources