We're considering Snowflake and want to understand how we could use it, and possibly other tools, to overcome one of our main problems - ETL! We currently use a legacy DWH with an ETL process consisting of SSIS and some views. This has all the common pitfalls of this methodology - most notably that it takes ages!
I was under the assumption that we'd move to an ELT model in Snowflake, I started to research tools to do the 'T' part of it, however, I'm just listening to this podcast: https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/
And it's suggesting that just slapping a SQL View over something and exposing it in say PowerBI or Tableau is enough for the T part of things!...
Just wondering what people's experience was here?
- Do you do transformations just by writing a view in Snowflake?
- Do you use a third party tool specifically to address this need?
Secondary to this, for the Extraction and Loading, do you:
- Do this using Snowflake only
- Use a third party tool
I'm specifically interested if you do this to create some kind of timeseries in Snowflake from a non timeseries source. That's something we'd be keen to do.
This question is hard to answer without sounding opinionated, especially not knowing your use case. For what it's worth here is what I think:
Don't stick views on top of your tables and expose to a reporting tool unless you have a very very simple setup. If you're considering a tool like Snowflake then you will probably want to go for something more sustainable, this approach can become prohibitive in terms of cost and complexity in your views.
Use a third-party tool to manage your ELT process. Your choice of tool will depend on your internal skills and cloud strategy, have a look at the tools out there like Stich, Fivetran etc. If you don't mind having on-premise technologies why not stick with SSIS or use something like Apache Airflow (requires up-skilling)
Snowflake will not help you with the E of ELT, you will need to use a third-party tool to manage the extract of data from your other systems like SSIS. It will help with the L part, for this you can use Snowpipe or COPY commands which are available within the Snowflake ecosystem. Snowflake will also help you share your data with external parties which is really nice.
My organization has created a fairly complicated dimensional model in Snowflake using layers of SQL views, against which we can point our reporting tools. We use a separate replication tool for extraction from source systems and loading into Snowflake. Using views simplifies our approach in that we don't need to use an additional tool. It also makes managing the code easier than something like SSIS. For instance, we can search for code using the Snowflake interface or our version control tool instead of having to open individual SSIS packages.
Related
I don't have any experience on any ETL tool. However I want to know if it is possible to do the followings using any ETL tool or we need to write a java or any other batch job to do this:
Scenario 1:
The source system has different REST APIs. I need to get the data, transform it, then store the data in a MongoDB.
The hardest part is the transformation. There can be situation where I need to call a REST API of source, and based on its data I need to call several other REST APIs using the 1st API data. After that we need to format the entire data in different format and store it in Mongo.
Scenario 2:
The source system has a DB. I need to transform the data using my custom logic and store it in MongoDB.
Here the custom logic can include things like this:
From table1 of source I created collection1. After that I need to consult table2 and previously created collection1, process the data and then create collection2.
Is this possible using any ETL tool? If possible then which tool? If possible please mention in as short as possible, how it can be done using different terminology so that I can search internet, learn things and implement it.
Briefly speaking: yes, that is what ETL tools are exactly for. You can Extract data from REST sources, Transform using sophisticated logic and Load to target, like MongoDB.
Exact implementation depends on the tool. While I guess you will get help if you run across problems implementing the solution in any of the tools, I don't think anyone will prepare complete, detailed solutions for you.
Are there any EDW (enterprise data warehouse) systems designed using NOSQL/Hadoop solution ?
I know there are PDW systems(MS PDW polybase, Greenplum hawq etc) which connect to HDFS sub-systems. These are proprietary hardware and software solutions and are expensive at scale.I am looking for a solution with NOSQL or Hadoop and preferably open source for enterprise data warehouse solution. I would like to hear any of your experiences if you have implemented any. Just to mention again, I am not looking for any type of proprietary RDBMS as a player in this EDW solution.
I did some research on the internet, though it's possible(Impala is a possible option) but did not see anyone really implemented completely with NOSQL or Hadoop.
If you have done something of this type, I would like to hear how you designed and what different tools that are used by your business analysts etc... If you can share your experience along the journey that would be really appreciated.
Updating....
How about VoltDb and NEOdb (which are not true RDBMS) but they claim that they can support ANSI SQL to a greater extent.
First problem you will face with building the EDW on top of Hadoop is the fact that its storage is not updatable, so you should forget about SQL UPDATE and DELETE commands.
Second, solution built on top of Hadoop is usually times more expensive to maintain. More expensive specialists, more complex debugging (compare debugging the problem in Hive query vs SQL query problems in Oracle - which would be easier).
Third, Hadoop usually gives you much less concurrency and much higher latency for any type of workload you put on top of it.
Given all of this, why do you think DWH is built on top of Hadoop only for really big enterprises like Facebook, Yahoo, Ebay, LinkedIn and so on? Because it is not that simple to do, while when implemented it can be more scalable and more customizable than any proprietary solution.
So if you are clearly decided to go on with Hadoop or any other NoSQL solution to build your DWH, I would recommend you this:
Use Hadoop HDFS as a base for data storage
Use Flume for data loading into the HDFS
Use Hive with Tez for heavy ETL jobs
Provide Impala as a SQL query interface for analysts
Provide Spark as an advanced instrument for analysts
Use Ambari for management and provisioning of all of tools together
These tools together will cover most of your needs
What is the best end-user language for data transformation (ETL), ideally something that can run in a JVM (with a LGPL like license).
With end-user meaning users who are IT related but might not be experts in OO or like.
You can try CloverETL. It is Java based ETL tool. You can either define transformation through its visual designer or handcode directly in XML files that are used to store data about transformations.
It also uses its own transformation language - CTL that is used inside transformation components. You do not need to be expert in OOP to use it.
Rust is the best language. In fact, companies can now evolve from ETL to STL or Stream, Transform and Load with a technology like InfinyOn Cloud. InfinyOn Cloud is a fully managed Fluvio service. This solution brief explains how the Financial Services industry is leveraging this tech:
https://www.infinyon.com/resources/real-time-data-trasformation/
I'm writing a report and thought you guys could help by providing me with the costs of company support in setting up and training a client on a data integrator for Salesforce. E.g., if someone wants to use Salesforce, but first needs a tool to consolidate and transfer data from back office systems to Salesforce how much would that support service cost?
Salesforce actually comes with a very good integration tool called Data Loader. It can be run as an interactive application under Windows or Macintosh, or it can be run as a command-line tool on Windows, Mac or Linux.
In interactive mode, it can import & export CSV files.
In batch mode it can also read data from, and write data to, a database.
For example, I have a Linux server where a daily cron job activates the Data Loader which runs several jobs. Some of these jobs run SQL against a database and upload the resulting data into Salesforce. Other jobs extract from Salesforce (using their SOQL query language, which is SQL-like) and store the information into a database.
Data Loader has a bit of a learning curve for batch mode (mostly around creating some XML configuration files), but the Interactive mode is very easy to use.
So, to answer your question... If it's a one-time data load, just run the interactive version and it's easy. If you want regularly-updated data, then use the batch mode. Support costs for operating the integration are really all in the setup. Once it's running, there shouldn't be any on-going costs unless the data structures change and you want to change the data being transferred. Better yet, if the system is setup by somebody who has done it before, you'll avoid a big learning curve.
If you want a figure to put into your report, then allow 3 days for the initial integration (allows for learning curve) and then a half-day for each additional one. That's generous, but provides extra time to debug problems.
To some degree, it depends on two factors:
Where is the data's source of truth?
How often do you want to sync the data?
If the answers are "it's a weird place and I only need to sync it once," then you probably want to figure out how to get it in CSV form and then use tools built into Salesforce to import it.
However, if the data lives in a database or data warehouse (postgres, mysql, mongo, redshift, snowflake, big query, etc) and especially if you want to keep Salesforce up to date with that source of truth continuously, then you could look into so-called "Reverse ETL" tools made for this purpose.
Costs depend on the tool chosen and the data volumes and other factors, but here are some options:
Grouparoo is an open source Reverse ETL tool. You can host it yourself for free. Paid plans start at $150/month.
Census is a SaaS Reverse ETL tool. Paid plans start at $300/month.
Hightouch is a SaaS Reverse ETL tool. Paid plans start at $350/month.
We have a client that has Oracle Standard, and a project that would be ten times easier addressed using OLAP. However, Oracle only supports OLAP in the Enterprise version.
Migration to enterprise is not possible
I'm thinking of doing some manual simulation of OLAP, creating relational tables to simulate the technology.
Do you know of some other way I could do this? Maybe an open-source tool for OLAP? Any ideas?
You can simulate OLAP functionality using client side tools pointed at a relational database.
Personally I think the best tool for the job is probably Tableau Desktop. This is an amazingly sophisticated front end analytics tool that will make your relational data look multidimensional without much effort, and the tool itself is really mind blowing. They have a free trial so you can take it for a spin. We use Tableau heavily for our own analysis and have been very impressed. Of course, this tool also works with multidimensional databases as well, so if you end up with some cubes at the end of the day you can continue to use the Tableau front end.
As for open source, you could try out Palo - an open source MOLAP server and Excel front end.
If you are interesting in building your own reporting front end and use .NET there are a number of components (such as the DevExpress PivotGrid or the several tools from RadarSoft) that will do the same thing, but will require some elbow grease to get wired together.
I find that it's the schema that causes most of the issues people have with querying a database. OLAP forces you to either a flat table or a Star/snowflake schema which is easy to query and comparably faster to the source oltp tables. So if you ETL your source to a flat table or star schema you should get 80% of what you get from OLAP, the 20% being MDX and analytic functions and performance.
Note that you should get a perf boost with a star schema in relational database as well and Oracle probably has analytic functions in PL/SQL anyways.
Try an open-source OLAP server called 'Mondrian'. IIRC the XMLA API on this is sufficiently compatible with AS to fool Pivot Table Services, which would allow you to use it with ProClarity or Excel.
IIRC it was originally designed to work over Oracle - it is a HOLAP architecture using base tables in the underlying relational store and caching aggregates. You can also make use of materialised views and query rewrite in the underlying Oracle database to do aggregates.
A few more thoughts on this topic:
Actually, Oracle Standard does have an OLAP facility based on a descendent of Express embedded in the database engine and storing its internal data structures in BLOBs in the main tablespaces. Using this is technically possible but not necessarily advisable for the following reasons:
It uses a highly non-standard OLAP query engine with very little third party tool support (AFAIK ArcPlan is the only third-party OLAP front-end supporting 10g+ OLAP), poor documentation for the query language and almost no third party literature describing it. This will work with B.I. Beans if you feel like writing a JSP front-end. It is not compatible with MDX at all. As of early 2006 the best Oracle could do when asked about drillthrough (this functionality was not supported in Discoverer 'Drake') was to recommend building a JSP apllication using B.I. Beans.
The reason that there is no migration path from Standard to Enterprise is that Enterprise is actually what used to be Siebel Analytics. Standard is the old Oracle OLAP/Express descendant which Oracle partners recommended avoiding even before Oracle bought out Seibel. Oracle has not even attempted to support migrating.
From this point of view, Mondrian is actually the most cost-effective OLAP solution for an Oracle Standard Edition shop. You can get a supported version from an outfit called Pentaho1. The next cheapest is Analysis Services, which comes with SQL Server. Following that you are into the likes of Hyperion Essbase, which will be an order of magnitude more expensive than SQL Server or any supported verion of Mondrian.
Whilst MS SQL Server offers OLAP, you'll need an Enterprise licence to use a cube in a live environment that is web-facing.
You might want as well to give a try to www.icCube.com - we're quite flexible on the data-source used to populate the cube and are quite cost effective compared to the big actors of the market.