Big Query - Pentaho JDBC Simba Driver Connector - jdbc

I'm using jdbc simba driver connector for connecting pentaho to bigquery, but the connector is really slow even after I am adding parameters like:
EnableHighThroughputAPI=1;HighThroughputMinTableSize=1;rewriteBatchedStatements=true;useCompression=true;useServerPrepStmts=false;
I'm trying to use steps like table output, insert/update, and dimension lookpu/update, is there any other way to optimize the connector or any other solution for pentaho and big query connection?

Related

Formatting data to use Confluent JDBC Sink Connector via ksql

I'd like to use the Confluent JDBC Sink Connector via ksql to write to ClickHouse database.
I have a c# application that writes the data to Kafka topic. How can I format the message from my application, so that it is acceptable for sink to write to the database? I don't want to use the Schema Registry or other ksql constructs.
KSQL accepts JSON or CSV data, however ClickHouse has it's own Kafka Connector, so shouldn't need JDBC Sink, which will only work with a message with a schema (meaning you will need to use the Schema Registry, which is not only a KSQL construct and can be used in your C# code as well)

Connect Pentaho Data Integration with Oracle Autonomous Data Warehouse

I'm starting with this database, up to now i get works the following cases:
Connect using Sql developer and import data from a csv
Connect by JDBC into a Java file
The problem now is that I need Pentaho DI connect with Oracle Autonomous dataWarehouse
Does any body knows how to do it? Maybe using JDBC connections?

Incremental fetch from Oracle

Is there any way to fetch incremental data from an Oracle database using user-defined query using JDBC?
We are ok to use Spark, Kafka or plain JDBC.
The only thing it should be able to support heavy load.
You've not specified the destination. If it's a Kafka topic then using Apache Kafka makes sense to do the extract too, using Kafka Connect.
In which case, you can use the Kafka Connect JDBC connector to do this. See here for the specifics on using incremental mode with a custom query.
++ EDIT ++
If your final target is BigQuery then you can use Kafka Connect for that too with the appropriate BigQuery connector. You can see an example of it in action here.

Kafka Topic to Oracle database using Kafka Connect API JDBC Sink Connector Example

I know to write a Kafka consumer and insert/update each record into Oracle database but I want to leverage Kafka Connect API and JDBC Sink Connector for this purpose. Except the property file, in my search I couldn't find a complete executable example with detailed steps to configure and write relevant code in Java to consume a Kafka topic with json message and insert/update (merge) a table in Oracle database using Kafka connect API with JDBC Sink Connector. Can someone point demonstrate an example including configuration and dependencies? Are there any disadvantages with this approach? Do we anticipate any potential issues when table data increases to millions?
Thanks in advance.
There won't be an example for your specific use-case becuase the JDBC connector is meant to be generic.
Here is one configuration example with an Oracle database
All you need is
A topic of some format
key.converter and value.converter to be set to deserialize that topic
Your JDBC string and database schema (tables, projection fields, etc)
Any other JDBC Sink Specific Options
All this goes in a Java properties / JSON file, not Java source code
If you have a specific issue creating this configuration, please comment.
Do we anticipate any potential issues when table data increases to millions?
Well, those issues would be database server related, not with Kafka Connect. For example, disk filling up or increased load while accepting continuous writes.
Are there any disadvantages with this approach?
You'd have to handle de-deduplication or record expiration (e.g. GDPR) separately, if you did want that.

JDBC connectivity from Airpal

Airpal currently uses presto client to connect to PrestoDB. However as I understand, it can also use JDBC for this connectivity. Is there any code available for this purpose? Even if it is for connecting to any other database it might be helpful for me. The model for presto client looks a lot different than other models like JDBC etc.
Airpal is using presto client connectivity and also using these objects (mostly for schema and data like Column, QueryResults etc.) internally in its various modules.
One way for providing JDBC connectivity is to move its lowest layer of DB connectivity (executeWith invocations of com.airbnb.airpal.core.execution.QueryCliemt: there is 1 for data and about 6 for metadata) to JDBC query execution. The JDBC results (mostly data and schema) can then be converted to presto client api equivalent objects and rest of the logic in airpal would follow.
Another approach is to rewrite airpal with native JDBC support by moving over to JDBC objects for internal use and communication as well. It looks like a much bigger change.
I am planning to add support for dynamically choosing between presto client or JDBC connectivity. I will use the com.airbnb.airpal.presto.QueryRunner to hold either a presto client session or a JDBC connection accordingly.

Resources