Talend and Apache Spark? - hadoop

I am confused as to where Talend and Apache spark fit in the big data ecosystem as both Apache Spark and Talend can be used for ETL.
Could someone please explain this with an example?

Talend is kind of tool based big data approach and supports all big data applications with in built components. Where as spark is code base approach and you need to write the code for use case.

In fact Talend Big Data studio generates Apache Spark code for the designed ETL jobs. So in essence they are the same.

Talend studio provides inbuilt components for spark is the main engine behind this. Due to the inbuilt components it decreases the coding time. But if you will code directly using spark with Scala java or python it needs time to build the common components. Talend makes life easier and it’s easy to adopt for the traditional etl developers. Example if someone comes from abi initio they can correlate by seeing the graph or lineage provided by Talend. But in order to extend the business component people need to write code I. Java with spark in the Talend studio. One more thing Talend takes care of packaging the jar and deploying it from windows to server and run and displays the result in its console.

Related

Can apache mahout ALS work without hadoop?

I tried using ParallelALSFactorizationJob, but it crashes here:
Exception in thread "main" java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
Command line help mentions using filesystem, but it seems it wants hadoop. How can I run it on Windows, mahout.cmd file is broken:
"===============DEPRECATION WARNING==============="
"This script is no longer supported for new drivers as of Mahout 0.10.0"
"Mahout's bash script is supported and if someone wants to contribute a fix for this"
"it would be appreciated."
So is that possible (ALS + Windows - hadoop)?
Mahout is a community-driven project and its community is very strong.
"Apache Mahout is one of the first and most prominent Big Data machine
learning platforms. It implements machine learning algorithms on top
of distributed processing platforms such as Hadoop and Spark."
-Tiwary, C. (2015). Learning Apache Mahout.
Apache Spark is an open-source, in-memory, general purpose computing system that runs on both Windows and Unix like systems. Instead of Hadoop-like disk-based computation, Spark uses cluster memory to upload all the data into the memory, and this data can be queried repeatedly.
"As Spark is gaining popularity among data scientists, the Mahout
community is also quickly working on making Mahout algorithms function
on Spark's execution engine to speed up its calculation 10 to 100
times faster. Mahout provides several important building blocks to
create recommendations using Spark."
-Gupta, A (2015). Learning Apache Mahout Classification.
(This last book also provides a step by step guide Using Mahout's Spark shell (they don't use Windows and it isn't clear if they use Hadoop or not though). For more information on that topic, see the implementation section at https://mahout.apache.org/users/sparkbindings/play-with-shell.html.)
In addition to this, you can build recommendation engines using Spark such as DataFrames, RDD, Pipelines, and Transforms available in Spark MLlib and
in Spark, (...) the Alternating Least Squares (ALS) method is used for
generating model-based collaborative filtering.
-Gorakala, S. (2016). Building Recommendation Engines.
At this point, there's one question still to answer before answering your question: can we run Spark without Hadoop?.
So, yes, it's possible to use ALS method on Windows using Spark (without Hadoop).

ETL vs Workflow Management and which to apply? Can they be used the same?

I am in the process of setting up a data pipeline for a client. I've spent a number of years being on the analysis side of things but now I am working with a small shop that only really has a production environment.
The first thing we did was to create a replicated instance of production but I would like to apply a sort of data warehouse mentality to make the analysis portion easier.
My question comes down to what tool to use? Also, why? I have been looking at solutions like Talened for ETL but also am very interested in Airflow. The problem is that I'm not quite sure which suits my needs better. I would like to monitor and create jobs easily (I write python pretty fluently so Airflow job creation isn't an issue) but also be able to transform the data as it comes in.
Any suggestions are much appreciated
Please consider that the open source of talend (Talend Open Studio) does not provide any monitoring / scheduling capabilities. It is only "code generator". The more sophisticated infrastructure is part of the enterprise editions.
For anyone that sees this. Four years later and what we have done is leverage Airflow for scheduling, Fivetran and/or Sticher for extraction and loading, and dbt for transformations.

what Technologies an ETL Developer can learn?

I am working as an ETL developer.
I am thinking to learn something new which is related to my experience.
I am not sure about which one to choose.
please suggest me which technology would be good to learn for my future like bigdata,R,Python etc.
I would suggest you learn R and python as they are pretty common technologies used in data applications, and when you are comfortable with them, move to apache spark for big data applications,spark use both R and python , as well as scala which can be another technology you can learn.
There are multiple tools, SAP Data Services , informatica, pentaho data integration, etc. Maybe you should evaluate which of them is the one in your organization.

ETL Tool Which is most configurable

I am looking for best suited ETL Tool for the following criteria.
Supports MongoDB
Accepts Metadata as input (Or accepts file and builds its metadata on the fly)
provides configurable Mapping. (mapping can be defined from outside development, using some file ot table)
Please suggest the tool which caters to the above needs.
Hmm, your questing is looking most configurable ETL tool. From past years of experience in ETL process, I can inform you that you will never find such tool that meets all your demands. Especially when you have Enterprise level data warehouse (needed because of high and complex reporting needs), the only one software solution is to build your own custom project based ETL software, which is often ungrateful.
But (big BUT), you can achieve at least 80% of needs with existing tools. Plugins, smart usage of scripts, good data-flow design and (if needed) small custom software in pair with scheduling could help you out to fulfill imagined process. ETL process doesn't seem to be different in compare to any other work - 80% of the work is done in 20% of time, and the rest of work (20%) is done in 80% of time.
My suggestion for you:
Pentaho Data Integration - free and open source
PDI is powerfull ETL tool, and surley can meet your demands. There is a plenty of plugins, solid level community and fine API if you're going to develop more plugins.
Pentaho Data Integration + Integration Server - Enterprise Edition - "cheap enough" for almost every medium size projects
Enterprise edition has everything like free edition, including more plugins (JMS producer for example), version control system, instaview's and ect.
Beside, it has it own Server so scheduling is software based (not OS based), logging, better management and most important thing - support!
Informatica or Microsoft SSIS - expensive and brilliant
I would not wasting words for this tools. Informatica is primary ETL oriented company that using Informatica on high level require deep understanding of DB/DWH design, ETL process, PL/SQL, dimensional modeling ect.
SSIS is primary constructed for SQL Server, so I don't see high usage needs if at least one of your source db or target db (DWH) is not running on SQL Server.
Conclusion
This is just a scratch of plenty tools that market provide to us. Someone else would probably not even mention these tools. Please look one of the lists.
Almost each BI system has it own ETL tool. Maybe the good choice would be to use it together, in that way you will be in possibility to use maximum from both.
Note: Good ETL project manager, or ETL developer can extend tool advantages to level that better/more expensive have!

MR code from Hive

I am learning Hadoop and was learning about Hive in recent past. Its like a SQL query which can be used instead of MR java code. My tutor said that Hive produces MR java code behind the scene, which in turn is exported as jar file and then it is ran on top of HDFS filesystem. But, he was not sure if we can get the MR code produced by Hive. I was reading Definitive guide and nothing was discussed about it. I wanted to know is it really so? As I think if we can get the MR code then that code can be refined further to achieve many other things (means can be customized for other tasks).
I know I am just in learning stage and should wait before asking or jumping to conclusions but was curious. I think if even above concept is correct then catching MR code should be possible. I am .Net C# developer and not familiar with Java and not an expert in Relational DBs, wanted to tell only if it matters.

Resources