what Technologies an ETL Developer can learn? - business-intelligence

I am working as an ETL developer.
I am thinking to learn something new which is related to my experience.
I am not sure about which one to choose.
please suggest me which technology would be good to learn for my future like bigdata,R,Python etc.

I would suggest you learn R and python as they are pretty common technologies used in data applications, and when you are comfortable with them, move to apache spark for big data applications,spark use both R and python , as well as scala which can be another technology you can learn.

There are multiple tools, SAP Data Services , informatica, pentaho data integration, etc. Maybe you should evaluate which of them is the one in your organization.


Reverse engineering DataStage code into Pig (for Hadoop)

I have a landscape of datastage applications which I want to reverse engineer into Pig... Rather than having to write fresh Pig code and try to replicate the datastage functionality.
Has anyone had experience of doing something similar?
Any tips on the best approach would be much appreciated.
What you want is a code migration from DataStage to Pig.
This can be done with a program transformation system, which are designed to parse/analyze/transform complex software systems.
You can learn more about the issues of using such a tool
at https://stackoverflow.com/a/3460977/120163

Learning Oracle and GeoSpatial Systems

Lately, I am getting more engrossed in learning Oracle and Geospatial systems. I feel that mapping systems, combined with solid data structure are two technologies that are making their niche in today's market.
If you are starting to learn about these technologies, where would you recommend starting off? If I understand correctly, the best way to learn them would be through actual work (or hobby), but I can't seem to find good places to get the resources to do so.
I would appreciate any advice, tips, resources and information everyone could provide to jump-start my learning and understanding of these technologies.
Saw a nice PDF relating about this, but for a hobbyist wanting to learn it, are there free tools to start off with it?
You appear to be interested in OLAP/BI combined with GIS/mapping.
See information on Spatial OLAP (aka SOLAP) at http://www.spatialbi.org/ , as well as this list of tools at http://spatialolap.scg.ulaval.ca/DevApproaches.asp
Also, see GeoKettle at http://www.spatialytics.org/

How to start learning hadoop [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am a Web developer. I have experience in Web technologies like JavaScript , Jquery , Php , HTML . I know basic concepts of C. Recently I had taken interest in learning more about mapreduce and hadoop. So I enrolled my self in parallel data processing in mapreduce course in my university. Since I dont have any prior programing knowledge in any object oriented languages like Java or C++ , how should I go about learning map reduce and hadoop. I have started to read Yahoo hadoop tutorials and also OReilly's Hadoop The Definitive Guide 2nd.Edition.
I would like you guys to suggest me ways I could go about learning mapreduce and hadoop.
Here are some nice YouTube videos on MapReduce
Also, here are nice tutorials on how to setup Hadoop on Ubuntu
You can access Hadoop from many different languages and a number of resources set up Hadoop for you. You could try Amazon's Elastic MapReduce (EMR), for instance, without having to go through the hassle of configuring the servers, workers, etc. This is a good way to get your head around MapReduce processing while delaying a bit the issues of learning how to use HDFS well, how to manage your scheduler, etc.
It's not hard to search for your favorite language & find Hadoop APIs for it or at least some tutorials on linking it with Hadoop. For instance, here's a walkthrough on a PHP app run on Hadoop: http://www.lunchpauze.com/2007/10/writing-hadoop-mapreduce-program-in-php.html
Answer 1 :
It is very desirable to know Java. Hadoop is written in Java. Its popular Sequence File format is dependent on Java.
Even if you use Hive or Pig, you'll probably need to write your own UDF someday. Some people still try to write them in other languages, but I guess that Java has more robust and primary support for them.
Most Hadoop tools are not mature enough (like Sqoop, HCatalog and so on), so you'll see many Java error stack traces and probably you'll want to hack the source code someday
Answer 2
It is not required for you to know Java.
As the others said, it would be very helpful depending on how complex your processing may be. However, there is an incredible amount you can do with just Pig and say Hive.
I would agree that it is fairly likely you will eventually need to write a user defined function (UDF), however, I've written those in Python, and it is very easy to write UDFs in Python.
Granted, if you have very stringent performance requirements, then a Java based MapReduce program would be the way to go. However, great advancements in performance are being made all of the time in both Pig and Hive.
So, the short answer to your question is, "No", it is not required for you to know Java in order to perform Hadoop development.
Source :
1) Learn Java. No way around that, sorry.
2) Profit! It'll be very easy after that -- Hadoop is pretty darn simple.
It sounds like you are on the right track. I recommend setting up some Virtual Machines on your home computer to start taking what you see in the books and implementing them in your VMs. As with many things the only way to become better at something is to practice it. Once you get into I am sure you will have enough knowledge to start a small project to implement Hadoop with. Here are some examples of things people have built with Hadoop: Powered by Hadoop
Go through the Yahoo Hadoop tutorial before going through Hadoop the definitive guide. The Yahoo tutorial gives you a very clean and easy understanding of the architecture.
I think the concepts are not arranged properly in the Book. That makes it a little difficult to study it.
So do not study it together. Go through the web tutorial first.
I just put together a paper on this topic. Great resources above, but I think you'll find some additional pointers here: http://images.globalknowledge.com/wwwimages/whitepaperpdf/WP_CL_Learning_Hadoop.pdf
Feel free to join my blog about Big Data - https://oyermolenko.blog. I’ve been working with Hadoop for a couple of years and in this blog want to share my experience from the early start. I came from .NET environment and faced a couple of challenges related to switching from one language into another. My blog is oriented on people who didn’t work with Hadoop but have some primary technical background like you. Step by step I want to cover the whole family of Big Data services, describe the concepts and common problems I met working with them. Hope you will enjoy it

Is Pentaho ETL and Data Analyzer good choice?

I was looking for ETL tool and on google found lot about Pentaho Kettle.
I also need a Data Analyzer to run on Star Schema so that business user can play around and generate any kind of report or matrix. Again PentaHo Analyzer is looking good.
Other part of the application will be developed in java and the application should be database agnostic.
Is Pentaho good enough or there are other tools I should check.
Pentaho seems to be pretty solid, offering the whole suite of BI tools, with improved integration reportedly on the way. But...the chances are that companies wanting to go the open source route for their BI solution are also most likely to end up using open source database technology...and in that sense "database agnostic" can easily be a double-edged sword. For instance, you can develop a cube in Microsoft's Analysis Services in the comfortable knowledge that whatver MDX/XMLA your cube sends to the database will be intrepeted consistently, holding very little in the way of nasty surprises.
Compare that to the Pentaho stack, which will typically end interacting with Postgresql or Mysql. I can't vouch for how Postgresql performs in the OLAP realm, but I do know from experience that Mysql - for all its undoubted strengths - has "issues" with the types of SQL that typically crops up all over the place in an OLAP solution (you can't get far in a cube without using GROUP BY or COUNT DISTINCT). So part of what you save in licence costs will almost certainly be used to solve issues arising from the fact the Pentaho doesn't always know which database it is talking to - robbing Peter to (at least partially) pay Paul, so to speak.
Unfortunately, more info is needed. For example:
will you need to exchange data with well-known apps (Oracle Financials, Remedy, etc)? If so, you can save a ton of time & money with an ETL solution that has support for that interface already built-in.
what database products (and versions) and file types do you need to talk to?
do you need to support querying of web-services?
do you need near real-time trickling of data?
do you need rule-level auditing & counts for accounting for every single row
do you need delta processing?
what kinds of machines do you need this to run on? linux? windows? mainframe?
what kind of version control, testing and build processes will this tool have to comply with?
what kind of performance & scalability do you need?
do you mind if the database ends up driving the transformations?
do you need this to run in userspace?
do you need to run parts of it on various networks disconnected from the rest? (not uncommon for extract processes)
how many interfaces and of what complexity do you need to support?
You can spend a lot of time deploying and learning an ETL tool - only to discover that it really doesn't meet your needs very well. You're best off taking a couple of hours to figure that out first.
I've used Talend before with some success. You create your translation by chaining operations together in a graphical designer. There were definitely some WTF's and it was difficult to deal with multi-line records, but it worked well otherwise.
Talend also generates Java and you can access the ETL processes remotely. The tool is also free, although they provide enterprise training and support.
There are lots of choices. Look at BIRT, Talend and Pentaho, if you want free tools. If you want much more robustness, look at Tableau and BIRT Analytics.

Required language, tools and approach for a scalable web application like twitter

Incase if you are to develop twitter today what language, tools and approach will one take. How will he start from the very frugal configuration and gradually scale to the levels twitter has reached today. Incase if you can provide direct responses like (PHP+ Apache+ memchached+ MySQL) or (JSP+TomCat/Glassfish+ MySQL / other db) etc.
The criteria is an architecture which scales easily without much engineering and the right language so that one doesnt need to rethink his decision once the same is in place.
(As far as I know, Twitter is RoR, Linked in is Java and Digg in Php. So not looking for just random thoughts :) ) Do support why do you think your option should suffice.
As you already say it, there are several applications that shows that several technologies are able to scale. Fortunately for them.
I think you should not focus only on "is this technology the best for scaling". But on the two following points :
Do you have skills in that technology ?
Is that technology adapted (by it's philosophy) to that application ?
Scaling is a thing. But if you can't develop your application with the "killer" technology because you don't understand it, it's anyway useless.
I recommend looking at the High Scalability website. You can build a scalable web app in virtually any language, but it's not just a matter of using the right technology and then plugging it in. You have to know what you're doing, no matter what technology you use!
Twitter was developed using the framework Ruby on Rails (ROR), and that seems to be a good choice. Ruby on rails is database agnostic (supports most databases), very scalable and very good for developing web applications quickly.
Cake is a popular alternative for PHP I haven't used Cake but hear it is very similar. The alternative to these open source alternatives would be a full blow enterprise environment like the microsoft .NET frameweork.
