Saving to elastic search from Google Dataflow streaming job - elasticsearch

I am saving data to BigQuery through a Google Dataflow Streaming job.
I want to insert this data into elastic search for rapid access.
Is it a good practice to call logstach from dataflow through http?

The Apache Beam Java SDK has a connector to read from/write to Elastic search. This should optimize the IO to be consistent with the Beam model.

Related

How to save data from spark to Google cloud platform?

I will extract the data from oracle database through Spark, and then I want to store this data from spark to any storage in Google cloud platform. Is it possible? Data size is around 10TB.
You can run Spark in GCP using Qubole. There are also "Data Connectors" available which will allow you to integrate with Oracle and other RDBMS systems.
A general flow could look like:
- Run a spark job using JDBC to read from Oracle
- Perform any necessary processing
- Write the data back out to GCS or BigQuery
Ref: https://www.qubole.com/blog/technical-overview-of-qubole-on-gcp/
and https://docs-gcp.qubole.com/
You can use cloud storage connector with apache-spark, here is the link through it which might help you can refer to it.
Google cloud connector

Loading data automatically from Oracle DB to Google BigQuery

Good day,
I have an Oracle DB and I need to load some tables so I can query them in BigQuery.
¿Is there a way of loading the data automatically, every 24 h, to Google BigQuery?
Any way would work. It could be loading into Data Storage and creating the tables from there, or loading into Google drive from the server.
I really need some ideas, I have read a lot of articles with no luck.
Check this tutorial by Progress:
https://www.progress.com/tutorials/cloud-and-hybrid/etl-on-premises-oracle-data-to-google-bigquery-using-google-cloud-dataflow
In this tutorial the main goal will be to connect to an On-Premises Oracle database, read the data, apply a simple transformation and write it to BigQuery. The code for this project has been uploaded to GitHub for your reference.
This solution uses Dataflow and Progress' Hybrid Data Pipeline tool:
Google Cloud Dataflow is a data processing service for both batch and real-time data streams. Dataflow allows you to build pipes to ingest data, then transform and process according to your needs before making that data available to analysis tools. DataDirect Hybrid Data Pipeline can be used to ingest both on-premises and cloud data with Google Cloud Dataflow.

How to index Azure storage data to elastic cloud

I am new to elastic search. I have data stored in Azure storage and I want to index it using elasticsearch. I have created a cluster at https://cloud.elastic.co. Do I need to create a service which will index the data in elastic cloud and then users can use/search this data using elastic search? How to index the data to elastic cloud using asp.net MVC?
Please suggest.
One way to approach this would be to write a console application that
pulls data from Azure storage using the Storage client in the WindowsAzure.Storage nuget package or similar
transforms data into documents according to your domain needs
bulk indexes documents into Elasticsearch in Elastic Cloud using the .NET Elasticsearch client NEST
If data will be updated in Azure storage and will need to be frequently indexed into Elasticsearch, consider making the console application an Azure Web Job.
Another approach would be to use Logstash in conjunction with the input plugin for Azure Storage blobs.

Elastic search with Google Big Query

I have the event logs loaded in elasticsearch engine and I visualise it using Kibana. My event logs are actually stored in the Google Big Query table. Currently I am dumping the json files to a Google bucket and download it to a local drive. Then using logstash, I move the json files from the local drive to the elastic search engine.
Now, I am trying to automate the process by establishing the connection between google big query and elastic search. From what I have read, I understand that there is a output connector which sends the data from elastic search to Google big query but not vice versa. Just wondering whether I should upload the json file to a kubernete cluster and then establish the connection between the cluster and Elastic search engine.
Any help with this regard would be appreciated.
Although this solution may be a little complex, I suggest some solution that you use Google Storage Connector with ES-Hadoop. These two are very mature and used in production-grade by many great companies.
Logstash over a lot of pods on Kubernetes will be very expensive and - I think - not a very nice, resilient and scalable approach.
Apache Beam has connectors for BigQuery and Elastic Search, I would definitly perform this using DataFlow so you don´t need to implement a complex ETL and staging storage. You can read the data from BigQuery using BigQueryIO.Read.from (take a look to this if performance is important BigQueryIO Read vs fromQuery) and load it into ElasticSearch using ElasticsearchIO.write()
Refer this how read data from BigQuery Dataflow
https://github.com/GoogleCloudPlatform/professional-services/blob/master/examples/dataflow-bigquery-transpose/src/main/java/com/google/cloud/pso/pipeline/Pivot.java
Elastic Search indexing
https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-elasticsearch-indexer
UPDATED 2019-06-24
Recently this year was release BigQuery Storage API which improve the parallelism to extract data from BigQuery and is natively supported by DataFlow. Refer to https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api for more details.
From the documentation
The BigQuery Storage API allows you to directly access tables in BigQuery storage. As a result, your pipeline can read from BigQuery storage faster than previously possible.
I have recently worked on a similar pipeline. A workflow I would suggest would either use the mentioned Google storage connector, or other methods to read your json files into a spark job. You should be able to quickly and easily transform your data, and then use the elasticsearch-spark plugin to load that data into your Elasticsearch cluster.
You can use Google Cloud Dataproc or Cloud Dataflow to run and schedule your job.
As of 2021, there is a Dataflow template that allows a "GCP native" connection between BigQuery and ElasticSearch
More information here in a blog post by elastic.co
Further documentation and step by step process by google

How to connect elasticsearch to apache spark streaming or storm?

We are building a real-time big data tool with open source tools. Our main goal is to supervise and analyze a network by getting logs from a kafka server in real-time. We saw in tutorials that we have to divide our tool in two sections: Analytic and Supervision as shown below.
For the supervision section we chose the solution Elasticsearch and Logstash.
Regarding the section analytic, my team and I are comparing Apache Storm Streaming and Apache Storm in order to use it with Elasticsearch. Despite the fact that Apache Storm is a true real-time data processing tool and faster than Apache Spark Streaming, it does not provide machine learning libraries like with Apache Spark. That's why we are thinking to choose Apache Spark. The elastic website indicates that it exists a connector ES-Hadoop to connect a Elasticsearch database to a Hadoop ecosystem. We can see that in the below figure.
However, We are a little bit confused with this picture because there is only spark SQL and not all the spark frameworks (MLlib, Spark Streaming..). We did some assumptions and we came out with two final possible architectures. We only wanted to know if there are technically correct and if we are not in the wrong direction.
With Apache Spark streaming:
With Apache Storm:
Both your architectural diagrams are ok. Keep on mind that spark streaming will not work in this scenario. Es-hadoop provides you with easy access apis to get and put data from and into elastic. Its also provides the methods to get the data inro the spark framework (RDD) or data frames inthe case of spark sql. Once the data is in the framework, all ml libraries can be applied to the data for ml or analytics generation. Elastic is not capable of streaming data so spark streaming in the strict sense is not possible. So in the diagram, the arrow to hdfs optional and then to spark streaming can be removed and the arrow juat pointa to hdfs. My concern, however, would be running mllib algos on the data in realtime and expect realtime performance. Typical use case might be do modwl generation off line and use the model in realtime for analysis.

Resources