Loading data automatically from Oracle DB to Google BigQuery - oracle

Good day,
I have an Oracle DB and I need to load some tables so I can query them in BigQuery.
¿Is there a way of loading the data automatically, every 24 h, to Google BigQuery?
Any way would work. It could be loading into Data Storage and creating the tables from there, or loading into Google drive from the server.
I really need some ideas, I have read a lot of articles with no luck.

Check this tutorial by Progress:
https://www.progress.com/tutorials/cloud-and-hybrid/etl-on-premises-oracle-data-to-google-bigquery-using-google-cloud-dataflow
In this tutorial the main goal will be to connect to an On-Premises Oracle database, read the data, apply a simple transformation and write it to BigQuery. The code for this project has been uploaded to GitHub for your reference.
This solution uses Dataflow and Progress' Hybrid Data Pipeline tool:
Google Cloud Dataflow is a data processing service for both batch and real-time data streams. Dataflow allows you to build pipes to ingest data, then transform and process according to your needs before making that data available to analysis tools. DataDirect Hybrid Data Pipeline can be used to ingest both on-premises and cloud data with Google Cloud Dataflow.

Related

ETL tool Snowflake

We are going to move from SQL server to Snowflake as our target database for the warehouse.
Today we have most of our ETL development done in ODI (Oracle Data Ingegrator).
So I'm intressted in to know if anyone is using ODI together with Snowflake and how it's woking.
And what experince/recommendations you have of other ETL tools together with Snowflake as target.
For example
Matillion
DBT
Xplenty
Today we have started with using NIFI moving the data from source to Azure blob storage.
But we are not sure if ODI is the right tool for the rest when we are in the cloud.
I'm really looking forward to see all your answers
Snowflake supports both transformations during (ETL) or after loading (ELT).
Snowflake works with a wide range of data integration tools, including Informatica, Talend, Tableau, Matillion and others.
In data engineering, new tools and self-service pipelines are eliminating traditional tasks such as manual ETL coding and data cleaning companies. With easy ETL or ELT options via Snowflake, data engineers can instead spend more time working on critical data strategy and pipeline optimization projects.
With a Snowflake as your data lake and data warehouse, ETL can be effectively eliminated, as no pre-transformations or pre-schemas are needed.
In addition, Snowflake Snowpark is designed to make building complex data pipelines a breeze and to allow developers to interact with Snowflake directly without moving data. Read more about Snowpark here.
https://www.snowflake.com/trending/etl-tools
If you started to transfer data from the source to Azure blob storage, I assume that you have a subscription in Azure and it is possible that Snowflake itself is placed in the Azure region.
In this case, I recommend using Azure Data Factory directly, so you have everything on one provider and support for data migration from SQL Server.
Link to documentation: Copy and transform data in Snowflake using Azure Data Factory

How to save data from spark to Google cloud platform?

I will extract the data from oracle database through Spark, and then I want to store this data from spark to any storage in Google cloud platform. Is it possible? Data size is around 10TB.
You can run Spark in GCP using Qubole. There are also "Data Connectors" available which will allow you to integrate with Oracle and other RDBMS systems.
A general flow could look like:
- Run a spark job using JDBC to read from Oracle
- Perform any necessary processing
- Write the data back out to GCS or BigQuery
Ref: https://www.qubole.com/blog/technical-overview-of-qubole-on-gcp/
and https://docs-gcp.qubole.com/
You can use cloud storage connector with apache-spark, here is the link through it which might help you can refer to it.
Google cloud connector

Building Data Lake from scratch

I am trying to build a "Data Lake" from scratch. I understand how a data lake works and the purpose of it; it's all over the internet. But when the question arises how to build one from scratch there is no source. I want to understand if:
Data warehouse + Hadoop = Data Lake
I know how to run Hadoop and bring in data into Hadoop.
I want to build a sample on premise data lake to demo my manager. Any help is appreciated.
You'd have to have structured and unstructured data to make a Hadoop cluster into a data lake.
So, you'd have to have some ETL pipeline taking the unstructured data and converting it to structured data. Product reviews or something similar would provide your unstructured data. Converting this to something usable by Hive (as an example) would give you your structured data.
I would look at https://opendata.stackexchange.com/ for getting your data and google Hadoop ETL for ideas on how to cleanse the data. It's up to you how you want to write your pipeline (Spark or MapReduce).
You can build datalake using AWS services. A simple way to do so is to use an AWS CloudFormation template to configure the solution, including AWS services such as Amazon S3 for unlimited data storage, Amazon Cognito for authentication, Amazon Elasticsearch for strong searching capabilities, AWS Lambda for microservices, AWS Glue for data transmission, and Amazon Athena for data analytics. The following figure represents the complete architecture of building a data lake on AWS using AWS services.
Refer this article for reference: https://medium.com/#pmahmoudzadeh/building-a-data-lake-on-aws-3f02f66a079e

Syncfusion Dashboard Platform: how to make dashboards fast?

I've setted up Syncfusion Dashboard Platform to work with some imported test data (Syncfusion Big Data).
I have used Integration platform for import in Hadoop (files are stored in Avro format).
Then I put Avro data from Hadoop to Hive tables via Spark SQL (convert Avro to table), in total I have 400 rows.
Then I used this Hive2 as Datasource in Dashboard platform.
Then I tried to fetch the data and it works very slowly.
Can somebody point me how to figure it out? Settings in Syncfusion are by default.
Thanks!
Recommended Solution:
As Hive Server2(Map Reduce) is well suited for batch processing with large data set, I recommend you to use Spark SQL data source for near real time analytics such as dashboard visualization. Because Spark SQL process data in-memory to avoid multiple disk I/O operations.
• Tables created under the Hive can also be accessed from “Spark SQL” in Syncfusion distribution as both uses same meta store database.
So you can use the “Spark SQL” connection type in Syncfusion Dashboard platform instead of “Hive”.

Elastic search with Google Big Query

I have the event logs loaded in elasticsearch engine and I visualise it using Kibana. My event logs are actually stored in the Google Big Query table. Currently I am dumping the json files to a Google bucket and download it to a local drive. Then using logstash, I move the json files from the local drive to the elastic search engine.
Now, I am trying to automate the process by establishing the connection between google big query and elastic search. From what I have read, I understand that there is a output connector which sends the data from elastic search to Google big query but not vice versa. Just wondering whether I should upload the json file to a kubernete cluster and then establish the connection between the cluster and Elastic search engine.
Any help with this regard would be appreciated.
Although this solution may be a little complex, I suggest some solution that you use Google Storage Connector with ES-Hadoop. These two are very mature and used in production-grade by many great companies.
Logstash over a lot of pods on Kubernetes will be very expensive and - I think - not a very nice, resilient and scalable approach.
Apache Beam has connectors for BigQuery and Elastic Search, I would definitly perform this using DataFlow so you don´t need to implement a complex ETL and staging storage. You can read the data from BigQuery using BigQueryIO.Read.from (take a look to this if performance is important BigQueryIO Read vs fromQuery) and load it into ElasticSearch using ElasticsearchIO.write()
Refer this how read data from BigQuery Dataflow
https://github.com/GoogleCloudPlatform/professional-services/blob/master/examples/dataflow-bigquery-transpose/src/main/java/com/google/cloud/pso/pipeline/Pivot.java
Elastic Search indexing
https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-elasticsearch-indexer
UPDATED 2019-06-24
Recently this year was release BigQuery Storage API which improve the parallelism to extract data from BigQuery and is natively supported by DataFlow. Refer to https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api for more details.
From the documentation
The BigQuery Storage API allows you to directly access tables in BigQuery storage. As a result, your pipeline can read from BigQuery storage faster than previously possible.
I have recently worked on a similar pipeline. A workflow I would suggest would either use the mentioned Google storage connector, or other methods to read your json files into a spark job. You should be able to quickly and easily transform your data, and then use the elasticsearch-spark plugin to load that data into your Elasticsearch cluster.
You can use Google Cloud Dataproc or Cloud Dataflow to run and schedule your job.
As of 2021, there is a Dataflow template that allows a "GCP native" connection between BigQuery and ElasticSearch
More information here in a blog post by elastic.co
Further documentation and step by step process by google

Resources