How is automation of ETL processing done? - etl

I am alloted 'Automation of ETL processing using Scripting Technologies' as my Engineering Final Year project. However I don't have any idea about exactly what is this project supposed to do. I konw the basic concepts of ETL. But can anyone help me about what is meant by automation of these processes.
I am not asking for the implemetation. Just an overview of what needs to be done

If you mean automation of initialization, it could be done by one of several things: You could kick off the process with the task scheduler, you could have a folder watcher that starts a process when it find a file in the folder it's watching, you could have a switch that watches an object's name an when the name is "on" it knows to run and rename the object to "off" when it's done, or it could be kicked off by a database trigger, etc. Or are we talking about using the scripts to automate the steps necessary to complete the ETL? Sounds like you need to talk to the prof.

ETL tools generally have schedulers
and OS's have schedulers
ie Unix with contrab
all ETL processes should be idempotent.

Related

ETL vs Workflow Management and which to apply? Can they be used the same?

I am in the process of setting up a data pipeline for a client. I've spent a number of years being on the analysis side of things but now I am working with a small shop that only really has a production environment.
The first thing we did was to create a replicated instance of production but I would like to apply a sort of data warehouse mentality to make the analysis portion easier.
My question comes down to what tool to use? Also, why? I have been looking at solutions like Talened for ETL but also am very interested in Airflow. The problem is that I'm not quite sure which suits my needs better. I would like to monitor and create jobs easily (I write python pretty fluently so Airflow job creation isn't an issue) but also be able to transform the data as it comes in.
Any suggestions are much appreciated
Please consider that the open source of talend (Talend Open Studio) does not provide any monitoring / scheduling capabilities. It is only "code generator". The more sophisticated infrastructure is part of the enterprise editions.
For anyone that sees this. Four years later and what we have done is leverage Airflow for scheduling, Fivetran and/or Sticher for extraction and loading, and dbt for transformations.

Which AWS product to use to run batch jobs?

I have a program written in C++11. On the current input it takes too long to run. Luckily, the data can be safely split into chunks for parallel processing, which makes it a good candidate for, say, a Map/Reduce service.
AWS EMR could be a possible solution. However, since my code uses many modern libraries, it's quite a pain to compile it on the instances that are assigned for Apache Hadoop clusters. For example, I want to use soci (not available at all), boost 1.58+ (1.53 is there), etc etc. I also need a modern C++ compiler.
Obviously, all libraries and compilers can be manually upgraded (and the process scripted), but this sounds like a lot of manual work. And what about slave nodes - will they get all the libraries? Somehow I'm not sure. And the whole process of initializing the environment can now take very long time - thus killing a lot of performance advantage that distributing the jobs was supposed to bring in to begin with.
On the other hand, I don't really need all the advanced functionality that Apache Hadoop provides. And I don't want to set up a personal permanent cluster with my own installation of Hadoop or similar, because I will need to run the tasks only periodically and most of the time the servers will be idle, wasting money.
So, what would be the best product (or overall strategy) that could do the following:
Grab the given binaries + set of input files
Run the binaries on a predefined number of instances, using a recent Linux, ideally Ubuntu 15.10
Put the resulting files in a predefined location (S3 bucket?)
Shut everything down
I am sure I could write a number of scripts using the aws tool to achieve that manually, but I really don't want to reinvent the wheel. Any thoughts?
Thanks in advance!
Honestly that would be pretty easy to script, and you'll need to probably use scripting to grab the latest code on the servers when they start up anyway. I would suggest looking into defining an AutoScaling group with scheduled scaling policies. Alternatively you could have a Lambda function scheduled to run and issue the API command to create your instances.
You could either have a startup script on the server AMI, or simply pass a user-data script when you create the instances, that pulls down the binaries and input files and runs the command. The final step of the script could be to copy results to S3 and shutdown the server.
The (relatively new) AWS Batch is made for this purpose specifically.

Quartz.NET vs Windows Scheduled Tasks. How different are they?

I'm looking for some comparison between Quartz.NET and Windows Scheduled Tasks?
How different are they? What are the pros and cons of each one? How do I choose which one to use?
TIA,
With Quartz.NET I could contrast some of the earlier points:
Code to write - You can express your intent in .NET language, write unit tests and debug the logic
Integration with event log, you have Common.Logging that allows to write even to db..
Robust and reliable too
Even richer API
It's mostly a question about what you need. Windows Scheduled tasks might give you all you need. But if you need clustering (distributed workers), fine-grained control over triggering or misfire handling rules, you might like to check what Quartz.NET has to offer on these areas.
Take the simplest that fills your requirements, but abstract enough to allow change.
My gut reaction would be to try and get the integral WinScheduler to work with your needs first before installing yet another scheduler - reasoning:
no installation required - installed and enabled by default
no code to write - jobs expressed as metadata
integration with event log etc.
robust and reliable - good enough for MSFT, Google etc.
reasonably rich API - create jobs, check status etc.
integrated with remote management tools
security integration - run jobs in different credentials
monitoring tooling
Then reach for Quartz if it doesn't meet your needs. Quartz certainly has many of these features too, but resist adding yet another service to own and manage if you can.
One important distinction, for me, that is not included in the other answers is what gets executed by the scheduler.
Windows Task Scheduler can only run executable programs and scripts. The code written for use within Quartz can directly interact with your project's .NET components.
With Task Scheduler, you'll have to write a shell executable or script. Inside of that shell, you can interact with your project's components. While writing this shell code is not a difficult process, you do have to consider deploying the extra files.
If you anticipate adding more scheduled tasks over the lifetime of the project, you may end up needing to create additional executable shells or script files, which requires updates to the deployment process. With Quartz, you don't need these files, which reduces the total effort needed to create and deploy additional tasks.
Unfortunately, Quartz.NET job assemblies can't be updated without restarting the process/host/service. That's a pretty big one for some folks (including myself).
It's entirely possible to build a framework for jobs running under Task Scheduler. MEF-based assemblies can be called by a single console app, with everything managed via a configuration UI. Here's a popular managed wrapper:
https://github.com/dahall/taskscheduler
https://www.nuget.org/packages/TaskScheduler
I did enjoy my brief time of working with Quart.NET, but the restart requirement was too big a problem to overcome. Marko has done a great job with it over the years, and he's always been helpful and responsive. Perhaps someday the project will get multiple AppDomain support, which would address this. (That said, it promises to be a lot of work. Kudos to he and his contributors if they decide to take it on.)
To paraphrase Marko, if you need:
Clustering (distributed workers)
Fine-grained control over triggering or misfire handling rules
...then Quartz.NET will be your requirement.

What is a "Job" (child process thing) in Windows, and when to use it?

Using Process Explorer (procexp.exe), especially with Google Chrome, child processes are called a Job. Same with Internet Explorer 8, but I noticed it first with Chrome.
What is a Job
What should I know about these things?
Why would (you|one) use them?
What scenarios should they be used?
What APIs are used.
I know the questions is a bit clumsy, please try and look past that. Thanks in advance.
I'm using WinXP by the way.
A Job under Process Explorer refers to Win32 Jobs. More information about this feature can be found here.
So,
1. What is a Job?
As above.
2. What should I know about these things?
If a job fails or becomes unstable, all processes it manages will become unstable or crash immediately.
3. Why would (you|one) use them?
They are interesting tools if my application/system fires up several processes. I can centralize certain tasks in one job and attach all processes to it. Like gracefully terminate all processes, manage their working sets, etc.
4. What scenarios should they be used?
Never did anything worthed using them. But as above. In applications or complex systems that fire up several processes. Under Chrome, for instance (since this is where you are seeing a job), it is quite possible the job is managing each process that is fired when you open a new tab.
5. What APIs are used?
The Win32 API
Ad 1/2. A job is a process with a job object assigned. They're used to manage groups of processes. One job object can have multiple process, but a process can be assigned to only one job object. You can also set several limits for the jobs, documented here.
Ad 5. CreateJobObject, AssignProcessToJobObject, SetInformationJobObject, TerminateJobObject, and few more, listed here.

Practical Alternative for Windows Scheduled Tasks (small shop)

I work in a very small shop (2 people), and since I started a few months back we have been relying on Windows Scheduled tasks. Finally, I've decided I've had enough grief with some of its inabilities such as
No logs that I can find except on a domain level (inaccessible to machine admins who aren't domain admins)
No alerting mechanism (e-mail, for one) when the job fails.
Once again, we are a small shop. I'm looking to do the analogous scheduling system upgrade than I'm doing with source control (VSS --> Subversion). I'm looking for suggestions of systems that
Are able to do the two things outlined above
Have been community-tested. I'd love to be a guinae pig for exciting software, but job scheduling is not my day job.
Ability to remotely manage jobs a plus
Free a plus. Cheap is okay, but I have very little interest in going through a full blown sales pitch with 7 power point presentations.
Built-in ability to run common tasks besides .EXE's a (minor) plus (run an assembly by name, run an Excel macro by name a plus, run a database stored procedure, etc.).
I think you can look at :
http://www.visualcron.com/
Consider Cygwin and its version of "cron". It meets requirements #1 thru 4 (though without a nice UI for #3.)
Apologize for kicking up the dust here on a very old thread. But I couldn't disagree more with what's been presented here.
Scheduled tasks in Windows are AWESOME (a %^#% load better than writing services I might add). Yes, not without limitations. But still extremely powerful. I rely on them in earnest for a variety of different things.
If you even have a slight grasp on c# you can write as custom "task" (essentially a console application) to do, well, virtually anything. If persistent/accessible logging is what you're after, why not something like Serilog or NLog? Even at the time of writing, it had a very robust feature set. This tool in and of itself, in conjunction with some c#, could've solved both your problems very easily.
Perhaps I'm missing the point, but it seems to me that this isn't really a problem. At least not anymore...
If you're looking for a free tool there is plenty of implementations for the popular Cron tool for Windows, for example CRONw. It's pretty easy to configure and maintain. You could easily write add custom WSH scripts to send your emails and add log entries.
If you're going commercial way BMC Control-M is arguably one of the best but I don't believe that it is particularly cheap.
You may also consider some upcoming packages like JobScheduler
Pretty old question, but we use Jenkins. Yes its main purpose is for CI\CD, but its also a really nice UI for CRON with a ton of plugins and integrations.

Resources