Does BigQuery and Hadoop connector work for federated tables? - hadoop

I'm following the example on:
https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example
I have a federated table on BigQuery. Would this be able to pull data from it?

The BigQuery connector doesn't currently have special logic for handling federated tables, so won't work correctly in that it would try to "export" to another GCS location. I've filed a GitHub issue to track this feature; in the meantime, if the federated data is indeed already in GCS, you should still be able to access it directly as a normal FileInputFormat (or sc.textFile), you just lose the schema/metadata benefits of going through BigQuery.

Related

Migrate data from snowflake to elasticsearch

we are using snowflake data warehouse in my project, we would like to replace snowflake with Elasticsearch as part of project enhancement POC,
i don't found any solutions for moving data from snowflake to Elasticsearch.
can anyone help me to resolve the above concerns.
please share sufficient information, steps etc.
Thanks in advance
don't found any clues on data migration.
You can try to do it into 2 steps:
export data from Elastic to AWS S3 bucket
load data from AWS S3 bucket to snowflake.
You need to implement the migration at schema level. Moreover if you specify the question with the issues. It will be helpful to answer and guide you.
You can use COPY command to export data from Snowflake to a file that can then be loaded to another system. However I am curious to know why you are trying to replace Snowflake with Elasticsearch, as these are 2 different technologies, serving very different functions.
You can export your data from Snowflake S3 copy command.
Export in multiparts so your s3 bucket has small files.
Then you can hook a lambda on S3 PUT Object. So on each file upload a lambda will trigger.
You can write code in your Lambda to make rest calls to Elasticsearch.

Loading data automatically from Oracle DB to Google BigQuery

Good day,
I have an Oracle DB and I need to load some tables so I can query them in BigQuery.
¿Is there a way of loading the data automatically, every 24 h, to Google BigQuery?
Any way would work. It could be loading into Data Storage and creating the tables from there, or loading into Google drive from the server.
I really need some ideas, I have read a lot of articles with no luck.
Check this tutorial by Progress:
https://www.progress.com/tutorials/cloud-and-hybrid/etl-on-premises-oracle-data-to-google-bigquery-using-google-cloud-dataflow
In this tutorial the main goal will be to connect to an On-Premises Oracle database, read the data, apply a simple transformation and write it to BigQuery. The code for this project has been uploaded to GitHub for your reference.
This solution uses Dataflow and Progress' Hybrid Data Pipeline tool:
Google Cloud Dataflow is a data processing service for both batch and real-time data streams. Dataflow allows you to build pipes to ingest data, then transform and process according to your needs before making that data available to analysis tools. DataDirect Hybrid Data Pipeline can be used to ingest both on-premises and cloud data with Google Cloud Dataflow.

What to use to read/write from dynamodb from Spark?

I'd like to know what's the best to use to read/write from dynamodb from Spark.
I've tried with the official API from dynamodb, also with the emr connector(hadoop and also with hive) and others.
But i've found (among other problems) that to perform a query a full scan is needed, and that's not something valid with big tables.
Any suggestions please?
The process you tried using emr-dynamodb-connector is generally the way most of the people use it.
However there is a library which you could use to connect to DynamoDb.
Generally accessing DynamoDb from spark is difficult because now you have tied spark executors with the DynamoDb throttle. One alternative you could try is to use Hbase or cassandra which I found better supported with spark usage, provides predicate pushdown etc.
Generally the way I use DynamoDB data on cluster with spark is by utilizing the DynamoDb stream. Collect the stream data in S3 and apply batch processing on that data.

Elastic search with Google Big Query

I have the event logs loaded in elasticsearch engine and I visualise it using Kibana. My event logs are actually stored in the Google Big Query table. Currently I am dumping the json files to a Google bucket and download it to a local drive. Then using logstash, I move the json files from the local drive to the elastic search engine.
Now, I am trying to automate the process by establishing the connection between google big query and elastic search. From what I have read, I understand that there is a output connector which sends the data from elastic search to Google big query but not vice versa. Just wondering whether I should upload the json file to a kubernete cluster and then establish the connection between the cluster and Elastic search engine.
Any help with this regard would be appreciated.
Although this solution may be a little complex, I suggest some solution that you use Google Storage Connector with ES-Hadoop. These two are very mature and used in production-grade by many great companies.
Logstash over a lot of pods on Kubernetes will be very expensive and - I think - not a very nice, resilient and scalable approach.
Apache Beam has connectors for BigQuery and Elastic Search, I would definitly perform this using DataFlow so you don´t need to implement a complex ETL and staging storage. You can read the data from BigQuery using BigQueryIO.Read.from (take a look to this if performance is important BigQueryIO Read vs fromQuery) and load it into ElasticSearch using ElasticsearchIO.write()
Refer this how read data from BigQuery Dataflow
https://github.com/GoogleCloudPlatform/professional-services/blob/master/examples/dataflow-bigquery-transpose/src/main/java/com/google/cloud/pso/pipeline/Pivot.java
Elastic Search indexing
https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-elasticsearch-indexer
UPDATED 2019-06-24
Recently this year was release BigQuery Storage API which improve the parallelism to extract data from BigQuery and is natively supported by DataFlow. Refer to https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api for more details.
From the documentation
The BigQuery Storage API allows you to directly access tables in BigQuery storage. As a result, your pipeline can read from BigQuery storage faster than previously possible.
I have recently worked on a similar pipeline. A workflow I would suggest would either use the mentioned Google storage connector, or other methods to read your json files into a spark job. You should be able to quickly and easily transform your data, and then use the elasticsearch-spark plugin to load that data into your Elasticsearch cluster.
You can use Google Cloud Dataproc or Cloud Dataflow to run and schedule your job.
As of 2021, there is a Dataflow template that allows a "GCP native" connection between BigQuery and ElasticSearch
More information here in a blog post by elastic.co
Further documentation and step by step process by google

What to use.. Impala on HDFS, or Impala on Hbase or just the Hbase?

I am working on Proof of Concept task.
The task is to implement a feature of our product using Hadoop technology.
Feature is quite simple, we have a UI which will let you insert details about "Network Issue".
All details about such a issue are captured and inserted into a table in Oracle DB.
We then process data in this table and calculate a Health Score.
I have to use Hadoop instead of a traditional Db So my question is what to go for?
Impala on HDFS? or
Impala on Hbase ? or
Hbase?
I am using a cloudera VM for the POC implementation.
As per my understanding, Hbase is NoSQL distributed database, which is actually a layer on HDFS , which provides java APIs to access data.
Impala is a tool which also provides JDBC access to access data over Hbase or directly over HDFS.
I am very new to hadoop, can some one please help?
Well, it depends on several things, like the kind of processing you are going to perform, desired response time etc. But by looking at whatever you have written here, HBase seems to be fine. I don't find any need of Impala as of now. HBase API is good and will serve your most of the needs.
IMHO, it's better to keep things simple initially and add a tool only if it is really required. Same holds good here. If you reach a point where you find that HBase API is not able to serve the purpose you could definitely add Impala to your stack.
That being said, there is one thing which you should keep in mind. HBase is a NoSQL DB and doesn't follow RDBMS conventions and terminologies. So, you might find it a bit strange initially. It's better to keep this in mind and then proceed as you have to design the schema in a way which is totally different from the RDBMS style of schema design.

Resources