How to save BigQuery query results to another table? - google-api

I want save query results into new table.
With BigQuery online editor like bigquery.cloud.google i easily do it with micro-solution from Felipe Hoffa.
Results with ~150.000.000 rows inserted with several seconds.
But how do i run query with "Destination Table" parameters via BigQuery API?

By using the Jobs.insert API call.
For example, in Java:
[...]
TableReference tableRef = new TableReference();
tableRef.setProjectId("<project>");
tableRef.setDatasetId("<dataset>");
tableRef.setTableId("<name>");
JobConfigurationQuery queryConfig = new JobConfigurationQuery();
queryConfig.setDestinationTable(tableRef);
queryConfig.setAllowLargeResults(true);
queryConfig.setQuery("some sql");
queryConfig.setCreateDisposition(CREATE_IF_NEEDED);
queryConfig.setWriteDisposition(WRITE_APPEND);
JobConfiguration config = new JobConfiguration().setQuery(queryConfig);
Job job = new Job();
job.setConfiguration(config);
Bigquery.Jobs.Insert insert = bigquery.jobs().insert("<projectid>", job);
JobReference jobId = insert.execute().getJobReference();
[...]

Related

Hive to Elastic search ingestion issues

Using Elastic search version 6.8.0
Complete Hive Job gets failed for a single malformed json record, I tried changing the
'es.write.rest.error.handler.es.return.default'='PASS/HANDLED' But no luck
Refer : https://www.elastic.co/guide/en/elasticsearch/hadoop/master/errorhandlers.html
Below is the DDL Script which is ran on hive prompt for ingestion
ADD JAR /home/smrafi/elasticsearch-hadoop-6.8.0/dist/elasticsearch-hadoop-6.8.0.jar;
CREATE external TABLE hive_es_with_handler10( data STRING)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'test_eshadoop/healthCareProvider','es.nodes' = 'xyzpqr','es.input.json' = 'yes','es.index.auto.create' = 'true','es.write.operation'='upsert',
'es.nodes.wan.only' = 'true','es.port' = '443','es.net.ssl'='true','es.batch.size.entries'='1','es.mapping.id' ='id','es.batch.write.retry.count'='-1',
'es.batch.write.retry.wait'='60s',
'es.write.data.error.handlers' = 'es',
'es.write.rest.error.handler.es.client.nodes' = 'vpc-pid-pre-prod-es-cluster-b7thvqfj3tp45arxl34gge3yyi.us-east-2.es.amazonaws.com',
'es.write.rest.error.handler.es.client.port' = '443',
'es.write.rest.error.handler.es.client.resource'='error_es_index',
'es.write.rest.error.handler.es.return.default'='PASS',
'es.write.rest.error.handler.es.return.error'='PASS');
insert into hive_es_with_handler10 select * from provider;
Below is exception trace, it failed complaining the error.handler index is not present
Caused by: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Could not locate write resource for ES error handler.
at org.elasticsearch.hadoop.util.Assert.hasText(Assert.java:30)
at org.elasticsearch.hadoop.handler.impl.elasticsearch.ElasticsearchHandler.init(ElasticsearchHandler.java:145)
at org.elasticsearch.hadoop.serialization.handler.write.impl.DelegatingErrorHandler.init(DelegatingErrorHandler.java:40)
at org.elasticsearch.hadoop.handler.impl.AbstractHandlerLoader.loadHandlers(AbstractHandlerLoader.java:114)
at org.elasticsearch.hadoop.serialization.bulk.BulkEntryWriter.<init>(BulkEntryWriter.java:56)
at org.elasticsearch.hadoop.rest.RestRepository.lazyInitWriting(RestRepository.java:138)
at org.elasticsearch.hadoop.rest.RestRepository.writeProcessedToIndex(RestRepository.java:185)
at org.elasticsearch.hadoop.hive.EsHiveOutputFormat$EsHiveRecordWriter.write(EsHiveOutputFormat.java:64)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
... 9 more
Below is the configuration to properly collect all the bad json record errors, Still there are issues with Hive, Hive doesnt support Malformed json records Please check this ElasticSearch hive SerializationError handler
ADD JAR /home/smrafi/elasticsearch-hadoop-6.8.0/dist/elasticsearch-hadoop-6.8.0.jar;
CREATE external TABLE hive_es_with_handler32( data STRING)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'test_eshadoop/healthCareProvider','es.nodes' = 'xyz','es.input.json' = 'yes','es.index.auto.create' = 'true','es.write.operation'='upsert',
'es.nodes.wan.only' = 'true','es.port' = '443','es.net.ssl'='true','es.batch.size.entries'='1','es.mapping.id' ='id','es.batch.write.retry.count'='-1',
'es.batch.write.retry.wait'='60s',
'es.write.rest.error.handlers' = 'es, ignoreBadRecords',
'es.write.data.error.handlers' = 'log, customLog, badJsonHandler',
'es.write.data.error.handler.customLog' = 'com.xyz.elshandler.CustomLogOnError',
'es.write.data.error.handler.badJsonHandler' = 'com.xyz.elshandler.BadJsonHandler',
'es.write.rest.error.handler.es.client.resource'="error_es_index/error",
'es.write.rest.error.handler.es.return.default'='HANDLED',
'es.write.rest.error.handler.log.logger.name' = 'BulkErrors',
'es.write.data.error.handler.log.logger.name' = 'SerializationErrors',
'es.write.rest.error.handler.ignoreBadRecords' = 'com.xyz.elshandler.IgnoreBadRecordHandler',
'es.write.rest.error.handler.es.return.error'='HANDLED');

How to add a CSV in the ClickHouse JDBC library

This is in a Java project with the ClickHouse JDBC library:
<dependency>
<groupId>com.clickhouse</groupId>
<artifactId>clickhouse-jdbc</artifactId>
<version>0.3.2-patch11</version>
</dependency>
I want to try to insert a CSV through a:
String url = "jdbc:ch://localhost";
ClickHouseDataSource dataSource = new ClickHouseDataSource(url, new Properties());
ClickHouseConnection conn = dataSource.getConnection("default", "password");
ClickHousePreparedStatement clickHousePreparedStatement = (ClickHousePreparedStatement) conn.prepareStatement("INSERT INTO table FORMAT CSV < ?");
I have tried to simulate something like this with:
PreparedStatement ps = conn.prepareStatement("INSERT INTO table FORMAT CSV < ?");
ps.setString(1, "simulated csv rows separated with commas");
It works, but it is not what I'm looking for. I want to know if the library has something specific, the documentation I have not found anything.
How can I attach a CSV in the ClickHouse JDBC library?

Very slow connection to Snowflake from Databricks

I am trying to connect to Snowflake using R in databricks, my connection works and I can make queries and retrieve data successfully, however my problem is that it can take more than 25 minutes to simply connect, but once connected all my queries are quick thereafter.
I am using the sparklyr function 'spark_read_source', which looks like this:
query<- spark_read_source(
sc = sc,
name = "query_tbl",
memory = FALSE,
overwrite = TRUE,
source = "snowflake",
options = append(sf_options, client_Q)
)
where 'sf_options' are a list of connection parameters which look similar to this;
sf_options <- list(
sfUrl = "https://<my_account>.snowflakecomputing.com",
sfUser = "<my_user>",
sfPassword = "<my_pass>",
sfDatabase = "<my_database>",
sfSchema = "<my_schema>",
sfWarehouse = "<my_warehouse>",
sfRole = "<my_role>"
)
and my query is a string appended to the 'options' arguement e.g.
client_Q <- 'SELECT * FROM <my_database>.<my_schema>.<my_table>'
I can't understand why it is taking so long, if I run the same query from RStudio using a local spark instance and 'dbGetQuery', it is instant.
Is spark_read_source the problem? Is it an issue between Snowflake and Databricks? Or something else? Any help would be great. Thanks.

Query hdfs with Spark Sql

I have a csv file in hdfs, how can I query this file with spark SQL? For example I would like to make a select request on special columns and get the result to be stored again to the Hadoop distributed file system
Thanks
you can achieve by creating Dataframe.
val dataFrame = spark.sparkContext
.textFile("examples/src/main/resources/people.csv")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
dataFrame.sql("<sql query>");
You should create a SparkSession. An example is here.
Load a CSV file: val df = sparkSession.read.csv("path to your file in HDFS").
Perform your select operation: val df2 = df.select("field1", "field2").
Write the results back: df2.write.csv("path to a new file in HDFS")

Streaming to HBase with pyspark

There is a fair amount of info online about bulk loading to HBase with Spark streaming using Scala (these two were particularly useful) and some info for Java, but there seems to be a lack of info for doing it with PySpark. So my questions are:
How can data be bulk loaded into HBase using PySpark?
Most examples in any language only show a single column per row being upserted. How can I upsert multiple columns per row?
The code I currently have is as follows:
if __name__ == "__main__":
context = SparkContext(appName="PythonHBaseBulkLoader")
streamingContext = StreamingContext(context, 5)
stream = streamingContext.textFileStream("file:///test/input");
stream.foreachRDD(bulk_load)
streamingContext.start()
streamingContext.awaitTermination()
What I need help with is the bulk load function
def bulk_load(rdd):
#???
I've made some progress previously, with many and various errors (as documented here and here)
So after much trial and error, I present here the best I have come up with. It works well, and successfully bulk loads data (using Puts or HFiles) I am perfectly willing to believe that it is not the best method, so any comments/other answers are welcome. This assume you're using a CSV for your data.
Bulk loading with Puts
By far the easiest way to bulk load, this simply creates a Put request for each cell in the CSV and queues them up to HBase.
def bulk_load(rdd):
#Your configuration will likely be different. Insert your own quorum and parent node and table name
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\#Split the input into individual lines
.flatMap(csv_to_key_value)#Convert the CSV line to key value pairs
load_rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
The function csv_to_key_value is where the magic happens:
def csv_to_key_value(row):
cols = row.split(",")#Split on commas.
#Each cell is a tuple of (key, [key, column-family, column-descriptor, value])
#Works well for n>=1 columns
result = ((cols[0], [cols[0], "f1", "c1", cols[1]]),
(cols[0], [cols[0], "f2", "c2", cols[2]]),
(cols[0], [cols[0], "f3", "c3", cols[3]]))
return result
The value converter we defined earlier will convert these tuples into HBase Puts
Bulk loading with HFiles
Bulk loading with HFiles is more efficient: rather than a Put request for each cell, an HFile is written directly and the RegionServer is simply told to point to the new HFile. This will use Py4J, so before the Python code we have to write a small Java program:
import py4j.GatewayServer;
import org.apache.hadoop.hbase.*;
public class GatewayApplication {
public static void main(String[] args)
{
GatewayApplication app = new GatewayApplication();
GatewayServer server = new GatewayServer(app);
server.start();
}
}
Compile this, and run it. Leave it running as long as your streaming is happening. Now update bulk_load as follows:
def bulk_load(rdd):
#The output class changes, everything else stays
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}#"org.apache.hadoop.hbase.client.Put"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)
#Don't process empty RDDs
if not load_rdd.isEmpty():
#saveAsNewAPIHadoopDataset changes to saveAsNewAPIHadoopFile
load_rdd.saveAsNewAPIHadoopFile("file:///tmp/hfiles" + startTime,
"org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",
conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
#The file has now been written, but HBase doesn't know about it
#Get a link to Py4J
gateway = JavaGateway()
#Convert conf to a fully fledged Configuration type
config = dict_to_conf(conf)
#Set up our HTable
htable = gateway.jvm.org.apache.hadoop.hbase.client.HTable(config, "Test")
#Set up our path
path = gateway.jvm.org.apache.hadoop.fs.Path("/tmp/hfiles" + startTime)
#Get a bulk loader
loader = gateway.jvm.org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles(config)
#Load the HFile
loader.doBulkLoad(path, htable)
else:
print("Nothing to process")
Finally, the fairly straightforward dict_to_conf:
def dict_to_conf(conf):
gateway = JavaGateway()
config = gateway.jvm.org.apache.hadoop.conf.Configuration()
keys = conf.keys()
vals = conf.values()
for i in range(len(keys)):
config.set(keys[i], vals[i])
return config
As you can see, bulk loading with HFiles is more complex than using Puts, but depending on your data load it is probably worth it since once you get it working it's not that difficult.
One last note on something that caught me off guard: HFiles expect the data they receive to be written in lexical order. This is not always guaranteed to be true, especially since "10" < "9". If you have designed your key to be unique, then this can be fixed easily:
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)#Sort in ascending order

Resources