Elastic search with Google Big Query - elasticsearch

I have the event logs loaded in elasticsearch engine and I visualise it using Kibana. My event logs are actually stored in the Google Big Query table. Currently I am dumping the json files to a Google bucket and download it to a local drive. Then using logstash, I move the json files from the local drive to the elastic search engine.
Now, I am trying to automate the process by establishing the connection between google big query and elastic search. From what I have read, I understand that there is a output connector which sends the data from elastic search to Google big query but not vice versa. Just wondering whether I should upload the json file to a kubernete cluster and then establish the connection between the cluster and Elastic search engine.
Any help with this regard would be appreciated.

Although this solution may be a little complex, I suggest some solution that you use Google Storage Connector with ES-Hadoop. These two are very mature and used in production-grade by many great companies.
Logstash over a lot of pods on Kubernetes will be very expensive and - I think - not a very nice, resilient and scalable approach.

Apache Beam has connectors for BigQuery and Elastic Search, I would definitly perform this using DataFlow so you don´t need to implement a complex ETL and staging storage. You can read the data from BigQuery using BigQueryIO.Read.from (take a look to this if performance is important BigQueryIO Read vs fromQuery) and load it into ElasticSearch using ElasticsearchIO.write()
Refer this how read data from BigQuery Dataflow
https://github.com/GoogleCloudPlatform/professional-services/blob/master/examples/dataflow-bigquery-transpose/src/main/java/com/google/cloud/pso/pipeline/Pivot.java
Elastic Search indexing
https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-elasticsearch-indexer
UPDATED 2019-06-24
Recently this year was release BigQuery Storage API which improve the parallelism to extract data from BigQuery and is natively supported by DataFlow. Refer to https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api for more details.
From the documentation
The BigQuery Storage API allows you to directly access tables in BigQuery storage. As a result, your pipeline can read from BigQuery storage faster than previously possible.

I have recently worked on a similar pipeline. A workflow I would suggest would either use the mentioned Google storage connector, or other methods to read your json files into a spark job. You should be able to quickly and easily transform your data, and then use the elasticsearch-spark plugin to load that data into your Elasticsearch cluster.
You can use Google Cloud Dataproc or Cloud Dataflow to run and schedule your job.

As of 2021, there is a Dataflow template that allows a "GCP native" connection between BigQuery and ElasticSearch
More information here in a blog post by elastic.co
Further documentation and step by step process by google

Related

How to take snapshots of specific indices in Elastic Cloud Enterprise?

In Elastic Cloud UI, You can take snapshots/backup of your entire on-disk data and store it in a file shared system, say, Object Store S3.
How do I backup only certain indices instead of all with using Elastic Cloud UI only? Is there a way?
If not then and only then I want to go with APIs.
If you link out to the Elasticsearch Service docs for Snapshot and Restore, you will see that we also link to the Elasticsearch Snapshot and Restore docs. Here you will find instructions to backup certain indices. You can use the API console to do this more easily through the Elastic Cloud UI.

Loading data automatically from Oracle DB to Google BigQuery

Good day,
I have an Oracle DB and I need to load some tables so I can query them in BigQuery.
¿Is there a way of loading the data automatically, every 24 h, to Google BigQuery?
Any way would work. It could be loading into Data Storage and creating the tables from there, or loading into Google drive from the server.
I really need some ideas, I have read a lot of articles with no luck.
Check this tutorial by Progress:
https://www.progress.com/tutorials/cloud-and-hybrid/etl-on-premises-oracle-data-to-google-bigquery-using-google-cloud-dataflow
In this tutorial the main goal will be to connect to an On-Premises Oracle database, read the data, apply a simple transformation and write it to BigQuery. The code for this project has been uploaded to GitHub for your reference.
This solution uses Dataflow and Progress' Hybrid Data Pipeline tool:
Google Cloud Dataflow is a data processing service for both batch and real-time data streams. Dataflow allows you to build pipes to ingest data, then transform and process according to your needs before making that data available to analysis tools. DataDirect Hybrid Data Pipeline can be used to ingest both on-premises and cloud data with Google Cloud Dataflow.

How to build relational graph using elasticsearch data

We are building log analytics applicaton in which we are using Graylog & Elasticsearch. Since I have installed Elasticsearch but somehow I want to take the data from elasticsearch and create relational graphs with the data on my own instead of using Xpack-Graph.
i could have used xpack graph api and do http calls to get data but its not free ware and i'm not sure that we will be able to buy one licence
is there any other alternative for xpack graph api which is free ??
or can i query directly to elastic using aggregation if so how feasible it is?? can yo share me some resource on this
Kindly share your thoughts on this.

Saving to elastic search from Google Dataflow streaming job

I am saving data to BigQuery through a Google Dataflow Streaming job.
I want to insert this data into elastic search for rapid access.
Is it a good practice to call logstach from dataflow through http?
The Apache Beam Java SDK has a connector to read from/write to Elastic search. This should optimize the IO to be consistent with the Beam model.

crawler + elasticsearch integration

I wasn't able to find out, how to crawl website and index data to elasticsearch. I managed to do that in the combination nutch+solr and as nutch should be able from the version 1.8 export data directly to elasticsearch (source), I tried to use nutch again. Nevertheless I didn't succeed. After trying to invoke
$ bin/nutch elasticindex
I get:
Error: Could not find or load main class elasticindex
I don't insist on using nutch. I just would need the simpliest way to crawl websites and index them to elasticsearch. The problem is, that I wasn't able to find any step-by-step tutorial and I'm quite new to these technologies.
So the question is - what would be the simpliest solution to integrate crawler to elasticsearch and if possible, I would be grateful for any step-by-step solution.
Did you have a look at the River Web plugin? https://github.com/codelibs/elasticsearch-river-web
It provides a good How To section, including creating the required indexes, scheduling (based on Quartz), authentication (basic and NTLM are supported), meta data extraction, ...
Might be worth having a look at the elasticsearch river plugins overview as well: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html#river
Since the River plugins have been deprecated, it may be worth having a look at ManifoldCF or Norconex Collectors.
You can evaluate indexing Common Crawl metadata into Elasticsearch using Hadoop:
When working with big volumes of data, Hadoop provides all the power to parallelize the data ingestion.
Here is an example that uses Cascading to index directly into Elasticsearch:
http://blogs.aws.amazon.com/bigdata/post/TxC0CXZ3RPPK7O/Indexing-Common-Crawl-Metadata-on-Amazon-EMR-Using-Cascading-and-Elasticsearch
The process involves the use of a Hadoop cluster (EMR on this example) running the Cascading application that indexes the JSON metadata directly into Elasticsearch.
Cascading source code is also available to understand how to handle the data ingestion in Elasticsearch.

Resources