I don't have any experience on any ETL tool. However I want to know if it is possible to do the followings using any ETL tool or we need to write a java or any other batch job to do this:
Scenario 1:
The source system has different REST APIs. I need to get the data, transform it, then store the data in a MongoDB.
The hardest part is the transformation. There can be situation where I need to call a REST API of source, and based on its data I need to call several other REST APIs using the 1st API data. After that we need to format the entire data in different format and store it in Mongo.
Scenario 2:
The source system has a DB. I need to transform the data using my custom logic and store it in MongoDB.
Here the custom logic can include things like this:
From table1 of source I created collection1. After that I need to consult table2 and previously created collection1, process the data and then create collection2.
Is this possible using any ETL tool? If possible then which tool? If possible please mention in as short as possible, how it can be done using different terminology so that I can search internet, learn things and implement it.
Briefly speaking: yes, that is what ETL tools are exactly for. You can Extract data from REST sources, Transform using sophisticated logic and Load to target, like MongoDB.
Exact implementation depends on the tool. While I guess you will get help if you run across problems implementing the solution in any of the tools, I don't think anyone will prepare complete, detailed solutions for you.
Related
I am very much new in Talend ETL tool.
I have a very basic question: Can I update the design workflow and transformation in Talend ETL tool at runtime?
I mean suppose my application is running in a server. Now I want to change the design workflow of the running application so the application will be updated to new design workflow at runtime. Similary I want to change the transformation logic at runtime. I think MuleSoft provides this provision.
Please need your help. Thanks in advance.
As #Jim Macaulay said in a comment, it depends on what you want to change.
Is it the columns that a row contains ?
Then you might need Dynamic Schema which is a paid feature (or use different flows, see next part).
Is it simply to alternate between 2 distinct datasources (or X
different flows) based on external stimulii ?
Then you could use the If trigger with context variables to use one or the other.
We're considering Snowflake and want to understand how we could use it, and possibly other tools, to overcome one of our main problems - ETL! We currently use a legacy DWH with an ETL process consisting of SSIS and some views. This has all the common pitfalls of this methodology - most notably that it takes ages!
I was under the assumption that we'd move to an ELT model in Snowflake, I started to research tools to do the 'T' part of it, however, I'm just listening to this podcast: https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/
And it's suggesting that just slapping a SQL View over something and exposing it in say PowerBI or Tableau is enough for the T part of things!...
Just wondering what people's experience was here?
- Do you do transformations just by writing a view in Snowflake?
- Do you use a third party tool specifically to address this need?
Secondary to this, for the Extraction and Loading, do you:
- Do this using Snowflake only
- Use a third party tool
I'm specifically interested if you do this to create some kind of timeseries in Snowflake from a non timeseries source. That's something we'd be keen to do.
This question is hard to answer without sounding opinionated, especially not knowing your use case. For what it's worth here is what I think:
Don't stick views on top of your tables and expose to a reporting tool unless you have a very very simple setup. If you're considering a tool like Snowflake then you will probably want to go for something more sustainable, this approach can become prohibitive in terms of cost and complexity in your views.
Use a third-party tool to manage your ELT process. Your choice of tool will depend on your internal skills and cloud strategy, have a look at the tools out there like Stich, Fivetran etc. If you don't mind having on-premise technologies why not stick with SSIS or use something like Apache Airflow (requires up-skilling)
Snowflake will not help you with the E of ELT, you will need to use a third-party tool to manage the extract of data from your other systems like SSIS. It will help with the L part, for this you can use Snowpipe or COPY commands which are available within the Snowflake ecosystem. Snowflake will also help you share your data with external parties which is really nice.
My organization has created a fairly complicated dimensional model in Snowflake using layers of SQL views, against which we can point our reporting tools. We use a separate replication tool for extraction from source systems and loading into Snowflake. Using views simplifies our approach in that we don't need to use an additional tool. It also makes managing the code easier than something like SSIS. For instance, we can search for code using the Snowflake interface or our version control tool instead of having to open individual SSIS packages.
I've spent a lot of time reading and watching videos of people talking about how they use tools designed for handling huge datasets and real-time processing in their architectures. And while I understand what it is that tools like Hadoop/Cassandra/Kafka etc do, no one seems to explain how the data gets from these large processing tools to rendering something on a client/webpage.
From what I understand of big data tools, is that you can't build your application the same way you would a standard web-app querying MySQL, which I can understand given the size of the data that flows through these tools, however, for all this talk of "realtime data analytics" I cannot find any explanation of how the actual analytics gets put in front of someone in terms of some chart/table/etc?
explain how the data gets from these large processing tools to rendering something on a client/webpage.
With respect to this, one way would be to process the big data using Spark or Hadoop and store the results onto a RDBMS. Then have your webapp pull data from RDBMS to render charts, table etc. I can provide you the examples that I have done myself if you need more information.
Impala supports ODBC/JDBC interfaces. So, you actually could hook up a web app to it the same way you do with MySQL.
Other stuff you might want to check out is HBase, Kudu or Solr. In some realtime architectures data ends up in one of those. And all of them have some sort of an API that you can use in your web app to access their data.
If you want a simple solution for realtime data processing and analytics, check out the new Stride API, which enables developers to collect, process, and analyze streaming data and then either visualize summary data in Stride or push processed data out to applications in realtime. This is a very easy way to build the kind of realtime reporting dashboards and monitoring / alerting systems you described above.
Take a look at the Stride API technical docs for examples and more info on how to implement this.
I need to create a website that reads contents of different websites and help to compare them.
One of the examples having a similar website
http://www.mysmartprice.com/mobile/samsung-galaxy-grand-2-msp3633
This helps us to compare prices of samsung mobile between different online websites.
Now I need to know :
1. How to read data from different websites.
Using java, I can read and fetch html data. But question arises, what is the best way to parse the html content to get desired information?
I want to use Spring XD. Please suggest best strategy?
Regards,
Jubin
I think you need to develop a java application for each data source, and then develop a custom module "source", and use Spring xd to ingest the data.
Another solution is to develop the application, make your applications load required data to csv files and trasfer them into a path like /tmp/xd/input automatically when the program runs, and then use Spring XD to ingest the data from csv files into whatever destination you need.
I'm writing a report and thought you guys could help by providing me with the costs of company support in setting up and training a client on a data integrator for Salesforce. E.g., if someone wants to use Salesforce, but first needs a tool to consolidate and transfer data from back office systems to Salesforce how much would that support service cost?
Salesforce actually comes with a very good integration tool called Data Loader. It can be run as an interactive application under Windows or Macintosh, or it can be run as a command-line tool on Windows, Mac or Linux.
In interactive mode, it can import & export CSV files.
In batch mode it can also read data from, and write data to, a database.
For example, I have a Linux server where a daily cron job activates the Data Loader which runs several jobs. Some of these jobs run SQL against a database and upload the resulting data into Salesforce. Other jobs extract from Salesforce (using their SOQL query language, which is SQL-like) and store the information into a database.
Data Loader has a bit of a learning curve for batch mode (mostly around creating some XML configuration files), but the Interactive mode is very easy to use.
So, to answer your question... If it's a one-time data load, just run the interactive version and it's easy. If you want regularly-updated data, then use the batch mode. Support costs for operating the integration are really all in the setup. Once it's running, there shouldn't be any on-going costs unless the data structures change and you want to change the data being transferred. Better yet, if the system is setup by somebody who has done it before, you'll avoid a big learning curve.
If you want a figure to put into your report, then allow 3 days for the initial integration (allows for learning curve) and then a half-day for each additional one. That's generous, but provides extra time to debug problems.
To some degree, it depends on two factors:
Where is the data's source of truth?
How often do you want to sync the data?
If the answers are "it's a weird place and I only need to sync it once," then you probably want to figure out how to get it in CSV form and then use tools built into Salesforce to import it.
However, if the data lives in a database or data warehouse (postgres, mysql, mongo, redshift, snowflake, big query, etc) and especially if you want to keep Salesforce up to date with that source of truth continuously, then you could look into so-called "Reverse ETL" tools made for this purpose.
Costs depend on the tool chosen and the data volumes and other factors, but here are some options:
Grouparoo is an open source Reverse ETL tool. You can host it yourself for free. Paid plans start at $150/month.
Census is a SaaS Reverse ETL tool. Paid plans start at $300/month.
Hightouch is a SaaS Reverse ETL tool. Paid plans start at $350/month.