Migrate data from snowflake to elasticsearch - elasticsearch

we are using snowflake data warehouse in my project, we would like to replace snowflake with Elasticsearch as part of project enhancement POC,
i don't found any solutions for moving data from snowflake to Elasticsearch.
can anyone help me to resolve the above concerns.
please share sufficient information, steps etc.
Thanks in advance
don't found any clues on data migration.

You can try to do it into 2 steps:
export data from Elastic to AWS S3 bucket
load data from AWS S3 bucket to snowflake.

You need to implement the migration at schema level. Moreover if you specify the question with the issues. It will be helpful to answer and guide you.

You can use COPY command to export data from Snowflake to a file that can then be loaded to another system. However I am curious to know why you are trying to replace Snowflake with Elasticsearch, as these are 2 different technologies, serving very different functions.

You can export your data from Snowflake S3 copy command.
Export in multiparts so your s3 bucket has small files.
Then you can hook a lambda on S3 PUT Object. So on each file upload a lambda will trigger.
You can write code in your Lambda to make rest calls to Elasticsearch.

Related

Loading data from S3 to Elasticsearch using AWS glue

I have muliple folders in a S3 bucket and each folder contains one JSON lines file.
I want to do two things with this data
Apply some transformations and get tabular data and save it to some database.
save these json objects, as it is to Elasticseach cluster for full text search
I am using AWS glue for this task and I know how to do 1, but, I can't find any resources that talks about getting data from s3 and storing it to elasticsearch using AWS glue.
Is there a way to do this?
If anyone is looking for an answer to this then I used Logstash to load files to Elasticsearch.

Loading data automatically from Oracle DB to Google BigQuery

Good day,
I have an Oracle DB and I need to load some tables so I can query them in BigQuery.
¿Is there a way of loading the data automatically, every 24 h, to Google BigQuery?
Any way would work. It could be loading into Data Storage and creating the tables from there, or loading into Google drive from the server.
I really need some ideas, I have read a lot of articles with no luck.
Check this tutorial by Progress:
https://www.progress.com/tutorials/cloud-and-hybrid/etl-on-premises-oracle-data-to-google-bigquery-using-google-cloud-dataflow
In this tutorial the main goal will be to connect to an On-Premises Oracle database, read the data, apply a simple transformation and write it to BigQuery. The code for this project has been uploaded to GitHub for your reference.
This solution uses Dataflow and Progress' Hybrid Data Pipeline tool:
Google Cloud Dataflow is a data processing service for both batch and real-time data streams. Dataflow allows you to build pipes to ingest data, then transform and process according to your needs before making that data available to analysis tools. DataDirect Hybrid Data Pipeline can be used to ingest both on-premises and cloud data with Google Cloud Dataflow.

Elastic search with Google Big Query

I have the event logs loaded in elasticsearch engine and I visualise it using Kibana. My event logs are actually stored in the Google Big Query table. Currently I am dumping the json files to a Google bucket and download it to a local drive. Then using logstash, I move the json files from the local drive to the elastic search engine.
Now, I am trying to automate the process by establishing the connection between google big query and elastic search. From what I have read, I understand that there is a output connector which sends the data from elastic search to Google big query but not vice versa. Just wondering whether I should upload the json file to a kubernete cluster and then establish the connection between the cluster and Elastic search engine.
Any help with this regard would be appreciated.
Although this solution may be a little complex, I suggest some solution that you use Google Storage Connector with ES-Hadoop. These two are very mature and used in production-grade by many great companies.
Logstash over a lot of pods on Kubernetes will be very expensive and - I think - not a very nice, resilient and scalable approach.
Apache Beam has connectors for BigQuery and Elastic Search, I would definitly perform this using DataFlow so you don´t need to implement a complex ETL and staging storage. You can read the data from BigQuery using BigQueryIO.Read.from (take a look to this if performance is important BigQueryIO Read vs fromQuery) and load it into ElasticSearch using ElasticsearchIO.write()
Refer this how read data from BigQuery Dataflow
https://github.com/GoogleCloudPlatform/professional-services/blob/master/examples/dataflow-bigquery-transpose/src/main/java/com/google/cloud/pso/pipeline/Pivot.java
Elastic Search indexing
https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-elasticsearch-indexer
UPDATED 2019-06-24
Recently this year was release BigQuery Storage API which improve the parallelism to extract data from BigQuery and is natively supported by DataFlow. Refer to https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api for more details.
From the documentation
The BigQuery Storage API allows you to directly access tables in BigQuery storage. As a result, your pipeline can read from BigQuery storage faster than previously possible.
I have recently worked on a similar pipeline. A workflow I would suggest would either use the mentioned Google storage connector, or other methods to read your json files into a spark job. You should be able to quickly and easily transform your data, and then use the elasticsearch-spark plugin to load that data into your Elasticsearch cluster.
You can use Google Cloud Dataproc or Cloud Dataflow to run and schedule your job.
As of 2021, there is a Dataflow template that allows a "GCP native" connection between BigQuery and ElasticSearch
More information here in a blog post by elastic.co
Further documentation and step by step process by google

Does BigQuery and Hadoop connector work for federated tables?

I'm following the example on:
https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example
I have a federated table on BigQuery. Would this be able to pull data from it?
The BigQuery connector doesn't currently have special logic for handling federated tables, so won't work correctly in that it would try to "export" to another GCS location. I've filed a GitHub issue to track this feature; in the meantime, if the federated data is indeed already in GCS, you should still be able to access it directly as a normal FileInputFormat (or sc.textFile), you just lose the schema/metadata benefits of going through BigQuery.

S3 to local hdfs data transfer using Sqoop

I have installed Apache hadoop on my local system and want to import data from Amazon s3 using Sqoop.
Is there any way to achive this.
If yes kindly help me how can i achieve this.
Examples would be much appreciated.
Please help me as soon as possible.
Note:I am not using Amazon EMR.
Sqoop is for getting data from relational databases only at the moment.
Try using "distcp" for getting data from S3.
The usage is documented here: http://wiki.apache.org/hadoop/AmazonS3 In the section "Running bulk copies in and out of S3"

Resources