Best language for data transformation (ETL) - etl

What is the best end-user language for data transformation (ETL), ideally something that can run in a JVM (with a LGPL like license).
With end-user meaning users who are IT related but might not be experts in OO or like.

You can try CloverETL. It is Java based ETL tool. You can either define transformation through its visual designer or handcode directly in XML files that are used to store data about transformations.
It also uses its own transformation language - CTL that is used inside transformation components. You do not need to be expert in OOP to use it.

Rust is the best language. In fact, companies can now evolve from ETL to STL or Stream, Transform and Load with a technology like InfinyOn Cloud. InfinyOn Cloud is a fully managed Fluvio service. This solution brief explains how the Financial Services industry is leveraging this tech:
https://www.infinyon.com/resources/real-time-data-trasformation/

Related

Should event driven architecture be targeted for all data & analytics platforms?

For example,
You have an IT estate where a mix of batch and real-time data sources exists from multiple systems, e.g. ERP, Project management, asset, website, monitoring etc.
The aim is to integrate the datasources into a cloud environment (agnostic).
There is a need for reporting and analytics on combinations of all data sources.
Inevitably, some source systems are not capable of streaming, hence batch loading is required.
Potential use-cases for performing functionality/changes/updates based on the ingested data.
Given a steer for creating a future-proofed platform, architecturally, how would you look to design it?
It's a very open-end question, but there are some good principles you can adopt to help direct you in the right direction:
Avoid point-to-point integration, and get everything going through a few common points - ideally one. Using an API Gateway can be a good place to start, the big players (Azure, AWS, GCP) all have their own options, plus there's lots of decent independent ones like Tyk or Kong.
Batches and event-streams are totally different, but even then you can still potentially route them all through the gateway so that you get the centralised observability (reporting, analytics, alerting, etc).
Use standards-based API specifications where possible. A good REST based API, based off a proper resource model is a non-trivial undertaking, not sure if it fits with what you are doing if you are dealing with lots of disparate legacy integration. If you are going to adopt REST, use OpenAPI to specify the API's. Using this standard not only makes it easier for consumers, but also helps you with better tooling as many design, build and test tools support OpenAPI. There's also AsyncAPI for event/async API's
Do some architecture. Moving sh*t to cloud doesn't remove the sh*t - it just moves it to the cloud. Don't recreate old problems in a new place.
Work out the logical components in your new solution: what does each of them do (what's it's reason to exist)? Don't forget ancillary components like API catalogues, etc.
Think about layering the integration (usually depending on how they will be consumed and what role they need to play, e.g. system interface, orchestration, experience APIs, etc).
Want to handle data in a consistent way regardless of source (your 'agnostic' comment)? You'll need to think through how data is ingested and processed. This might lead you into more data / ETL centric considerations rather than integration ones.
Co-design. Is the integration mainly data coming in or going out? Is the integration with 3rd parties or strictly internal?
If you are designing for external / 3rd party consumers then a co-design process is advised, since you're essentially designing the API for them.
If the API's are for internal use, consider designing them for external use so that when/if you decide to do that later it's not so hard.
Taker a step back:
Continually ask yourselves "what problem are we trying to solve?". Usually, a technology initiate is successful if there's a well understood reason for doing it, which has solid buy-in from the business (non-IT).
Who wants the reporting, and why - what problem are they trying to solve?
As you mentioned its an IT estate aka enterprise level solution mix of batch and real time so first you have to identify what is end goal of this migration. You can think of refactoring applications. If you are trying to make it event driven then assess the refactoring efforts and cost. Separation of responsibility is the key factor for refactoring and migration.
If you are thinking about future proofing your solution then consider Cloud for storing and processing your data. Not necessary it will be cheap but mix of Cloud and on-prem could be a way. There are services available by cloud providers to move your data in minimal cost. Cloud native solutions are there for performing analysis on your data. Database migration service in AWS or Azure can move data and then capture on-going changes. So you can keep using on-prem db & apps and perform analysis for reporting on cloud. It will ease out load on your transactional DB. Most data sync from on-prem to cloud is near real time.

Why dataweave over template engines like Velocity/Freemarker/Thymeleaf

I see a broad adoption of Dataweave which I feel is more of transformation library just like Freemarker or Velocity.
In case of DW Change in transformation logic would need change in code, the very same purpose template engines got popular at the first place to seperate logic and code so that we can change transformation logic without needing to rebuild/repackage our code (more deployment hassle).
Can anyone help me to point out few reasons as to why one would prefer DW .
TLDR: If you're looking for a template engine for things like static websites, DataWeave definitely isn't the right choice. Use the right tool for the job. Also, while you can use DataWeave outside of Mule, I don't think I've seen anyone adopt DataWeave that hasn't adopted MuleSoft..
A few things to consider (and most of these I'm stating in the context of developing Mule applications):
These template engines are, typically, for outputting static text. If you're using it to output structured data rather than something like an HTML page.. you're probably doing it wrong. They aren't going to return structured data - they are going to return text. If you're at the very end of your flow and you're going to output that back out of the API or to a file, you're fine I suppose.. but if you want to actually be able to work with that output, you're going to have to convert the plain text to an actual object... introducing a lot of extra steps in this process when you could have just used DataWeave in the first place. Dataweave is especially beneficial when you want to do things like streaming because you're processing large payloads. Dataweave can understand JSON, XML, and CSV (the three most common data types I see) in a streamed format without any additional work, making it very easy to create efficient applications. The big difference between a template engine and a data transformation language is that one is for outputting text using structured data as input, and the other is for working with structured data on the input and outputting structured data that you can continue to work with. There is a reason that almost all of the template engine docs talk about building websites and not things like integrations.
The DataWeave engine is, as Aled indicated, built into the Mule runtime. Deeply so. You can use DataWeave in any field in any connector by default, even fields that don't have the f(x) button - because it's built into the runtime. This makes DataWeave what you could consider a first-class citizen within Mule, unlike something you will only be able to utilize either via connectors or by invoking java bridges/libraries.. which you do via DataWeave or a long series of connector operations.
The benefits you listed are also not things you can't do with DataWeave. You can VERY easily templatize and externalize DataWeave - for example, I have several DataWeave libraries in my maven repo I can include as dependencies. I've built several transformation services that use databases with DataWeave in order to do transformation, allowing me to change those transformations without modifying the app. You can also use dynamic DataWeave, where you use a template system to load specific parts of the script before running it. I've even taken it a step further and written a generic DataWeave script that I can use to do basic mappings without writing DataWeave - this allowed me to wrap a web UI around things pretty easily.
I wouldn't use DataWeave outside of MuleSoft unless you're a MuleSoft shop. If you are a MuleSoft shop, using the CLI to run your scripts, the same way you do with most interpreted languages, works fairly nicely - especially since you likely already have in-house expertise in DataWeave. The language is still niche enough that unless you've already adopted it for use in Mule applications I don't see any advantage in using it.
Docs / basic examples:
https://github.com/mulesoft-labs/data-weave-native
https://docs.mulesoft.com/mule-runtime/4.3/parse-template-reference
https://docs.mulesoft.com/mule-runtime/4.3/dataweave-create-module
https://github.com/mikeacjones/transform-system-api
Because it is the expression and transformation language embedded in Mule runtime. If you are using Mule it is also integrated with the IDE Anypoint Studio.
Outside Mule applications I don't think you can use DataWeave easily. You might want to go with the alternatives.

Data Transformations in Snowflake - View, Tools etc?

We're considering Snowflake and want to understand how we could use it, and possibly other tools, to overcome one of our main problems - ETL! We currently use a legacy DWH with an ETL process consisting of SSIS and some views. This has all the common pitfalls of this methodology - most notably that it takes ages!
I was under the assumption that we'd move to an ELT model in Snowflake, I started to research tools to do the 'T' part of it, however, I'm just listening to this podcast: https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/
And it's suggesting that just slapping a SQL View over something and exposing it in say PowerBI or Tableau is enough for the T part of things!...
Just wondering what people's experience was here?
- Do you do transformations just by writing a view in Snowflake?
- Do you use a third party tool specifically to address this need?
Secondary to this, for the Extraction and Loading, do you:
- Do this using Snowflake only
- Use a third party tool
I'm specifically interested if you do this to create some kind of timeseries in Snowflake from a non timeseries source. That's something we'd be keen to do.
This question is hard to answer without sounding opinionated, especially not knowing your use case. For what it's worth here is what I think:
Don't stick views on top of your tables and expose to a reporting tool unless you have a very very simple setup. If you're considering a tool like Snowflake then you will probably want to go for something more sustainable, this approach can become prohibitive in terms of cost and complexity in your views.
Use a third-party tool to manage your ELT process. Your choice of tool will depend on your internal skills and cloud strategy, have a look at the tools out there like Stich, Fivetran etc. If you don't mind having on-premise technologies why not stick with SSIS or use something like Apache Airflow (requires up-skilling)
Snowflake will not help you with the E of ELT, you will need to use a third-party tool to manage the extract of data from your other systems like SSIS. It will help with the L part, for this you can use Snowpipe or COPY commands which are available within the Snowflake ecosystem. Snowflake will also help you share your data with external parties which is really nice.
My organization has created a fairly complicated dimensional model in Snowflake using layers of SQL views, against which we can point our reporting tools. We use a separate replication tool for extraction from source systems and loading into Snowflake. Using views simplifies our approach in that we don't need to use an additional tool. It also makes managing the code easier than something like SSIS. For instance, we can search for code using the Snowflake interface or our version control tool instead of having to open individual SSIS packages.

A Priori Graphic Tool For Building Microservices

I've spend a lot of time for searching a tool for draw A-Priori Diagrams of Microservices Architecture which can be used for create the first layer of OpenAPI Specification of the infrastructure. Which can be later extenden directly throw Json or Yaml and the original graphic output should be updated consequently.
So far I was unable to find out any tool that can help me to do so.
I'm wondering if some who out there have in mind some tools for this othervise i will use like always draw.io
Thanks in advance.
You can try one of these:
https://www.visual-paradigm.com/solution/freeumldesigntool/
https://sparxsystems.com/products/ea/compare-editions.html
https://www.websequencediagrams.com
https://www.gliffy.com/
I am not really sure if you can export the diagram to json and yaml. In some of them you can export to html and xml.
Gliffy is good if you want to have nice diagrams for AWS services and similar.

Which CEP product to start with?

I want to learn more on how to build CEP based applications. So I looked around and found several products (overview found here: http://rulecore.com/CEPblog/?page_id=47).
But as there are quite a few at the moment, I don't know which is the best to start with. And overall I just would consider the one available for free. The rest is a bit to expensive for just private use ;)
Esper is for free, but without Esper studio it seems quite tedious to develop a cep app. Streambase offers a free trial, but I couldn't find out how long you can use this (if only for a month, no that helpful for longer research). Oracle CEP suite seems quite complete, but in the cep scene - as far as I can see - it is the least recognized compared to Esper or Streambase.
So do you have any hints on what is the best way to start with cep development? Is it worth to spent time on working through the oracle documenation or is it better to start with Esper or Streambase?
Cheers,
Andreas
Microsoft's CEP offering StreamInsight which closely resembles the reactive programming model of the Rx Framework and LINQ.
A Hitchhiker's Guide to StreamInsight Queries is a good place to start.
Some Code Examples
I would recommend using LINQPad which can connect to Stream Insight as a canvas for your queries.
The current CEP tools do not solve identical problems! So depending on what you like to do you'd like use different tools. In short, my personal choices would be:
For building data driven algorithms, coding in a type of SQL with extensions - The Coral8 engine from Aleri. Free for test and development (Was anyway before bought by Aleri)
For detecting event patterns (situations), no coding (declarative style) but configuration using XML - RuleCore, free test subscription to (Web)service
For a mix of both with low level control and coding in Java - Esper, GPL.
For creating data driven computation logic using graphical boxes-and-arrows style of GUI: StreamBase.
I think the best choice is to compare the solutions that are freely available and then make something with them.
I'm not sure what your end goals are, if it's to learn a technology that you use at work or just to play around with something cool, but for me on a project like this, the deciding factor would be which tool can I use to make something I could share with the world.
In this case, my options would probably be Esper or OpenESB. That way, I could put the project on a resume (especially if I was applying for a job that used CEP tools) and share it with the world.
You could read the blog of Curt Monash (http://www.dbms2.com) , he writes about things like CEP.
would there be any interest in a free subscription to the ruleCore (Cloud, SaaS or whatever these are called today) Service? It would be running on smaller and less reliable (no cluster) hardware and probably only usable for testing out small low performance kind of things. If support#rulecore.com gets a couple of requests of this kind I'm sure it's put up onto the todo list...
For detecting event patterns I found that rulecore is pretty easy to use. I have only tried to detect patterns of low and medium complexity and that did work fine. It takes some time to get used to the concepts but is it actually a very small system so it was not that bad. And you need to like XML as everything is done using XML.
If you are trying to create a trading application then StreamBase would be better. But for monitoring stuff rulecore feels better.
If you have continuous streams (market feeds, IoT sensors, Twitter, news, etc), then stream processing technology is the right choice for you. Stream processing / streaming analytics is only a part of different CEP solutions (streams, rules, patterns, etc.).
There are several open source options for stream processing in the meantime, e.g. Apache Storm, Apache Spark or Apache Samza, but also powerful proprietary products such as IBM InfoSphere Streams, TIBCO StreamBase or Software AG's Apama.
Take a look at my blog post respectively article for more details about different stream processing and streaming analytics solutions (open source and proprietary):
Comparison of Stream Processing and Streaming Analytics Alternatives (Apache Storm, Spark, IBM InfoSphere Streams, TIBCO StreamBase, Software AG Apama)
i would start with the free trial of Aleri Coral8 (currently Sybase)

Resources