My requirement is to select millions of rows from a MYSQL database using the JdbcIO API. I am using JdbcIO version 2.7.0.
Sample code:
pipeline.apply(JdbcIO.read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration
.create(jdbcProperties.getProperty("driver"),
jdbcProperties.getProperty("url"))
.withUsername(jdbcProperties.getProperty("username"))
.withPassword(jdbcProperties.getProperty("password")))
.withQuery(query.toString())
The performance of the query is slow, as the JDBC connection is kind of a single threaded operation at the worker node level.
I checked the source code for the readAll() method and they are also not supporting parallel read/write operations.
By contrast, Spark has the following properties to parallelize the JDBC read/write operations - numPartitions, partitionColumn, lowerBound, upperBound.
Do we have a similar option in the JdbcIO API to partition the query and parallelize the JDBC read/write operations, so that bulk data retrieved from the database is processed in parallel ?
Related
In the Spark dataframe, suppose I fetch data from oracle as below.
Will the query happen completely in oracle? Assume the query is huge. Is it an overhead to oracle then? Would a better approach be to read each filtered table data in a separate dataframe and join it using spark SQL or dataframe so that a complete join will happen in Spark? Can you please help with this?
df = sqlContext.read.format('jdbc').options(
url="jdbc:mysql://foo.com:1111",
dbtable="(SELECT * FROM abc,bcd.... where abc.id= bcd.id.....) AS table1", user="test",
password="******",
driver="com.mysql.jdbc.Driver").load()
In general, actual data movement is the most time consuming and should be avoided. So, as general rule, you want to filter as much as possible in the JDBC source (Oracle in your case) before data are moved into your Spark environment.
Once you're ready to do some analysis in Spark, you can persist (cache) the result so as to avoid re-retrieving from Oracle every time.
That being said, #shrey-jakhmola is right, you want to benchmark for your particular circumstance. Is the Oracle environment choked somehow, perhaps?
I am working on designing a scalable service(springboot) using which data will be indexed to elastic search.
Use case:
My application uses 6 databases(mySql) having same schema. Each database caters to specific region. I have a micro service that connects to all these dbs and indexes data from specific tables to elasicsearch server(v6.8.8) in similar fashion having 6 elasticsearch indexes one for each db.
Quartz jobs are employed for this purpose and RestHighLevelClient. Also there are delta jobs running each second to look for changes using audit and indexes.
Current problem:
Current design is not scalable - one service doing all the work(data loading, mapping, upsert in bulk). Because indexing is done through quarts jobs, scaling services(running multiple instances) will run the same job multiple times.
No failover - Looking for a distributed elasticsearch nodes and indexing data to both nodes. How to do this efficiently.
I am considering spring data elasticsearch to index data sametime when it is going to be persisted to db.
Does it offer all features ? I use :
Elasticsearch right from installing template to creating/deleting indexes, aliases.
Blue/green deployment - index to non-active nodes and change the aliases.
bulk upsert, querying, aggregations..etc
Any other solutions are welcome. Thanks for your time.
Your one of the use case is to move data from DB (Mysql) to ES in a scalable manner. It is basically a CDC (Change data capture) pipeline.
You can use kafka-connect framework for the same.
The flow should be like:
Read Mysql Transaction logs => Publish the data to Kafka (This can be accomplished using Debezium Source Connector)
Consume data from Kafka => Push it to Elastic Search (This can be accomplished using ES-SYNC Connector)
Why to use the framework ?
Using connect framework data can be read directly from Mysql Transaction logs without writing code.
Connect framework is a distributed & Scalable system
It will reduce the load on your database as you now don't need to query your database for detecting any changes
Easy to set-up
What is the best ways to parallel ingest data from Teradata database into Hadoop with parallel data moving?
If we create a job which is simple opens one session to Teradata database it will take a lot of time to load huge table.
if we create a set of sessions to load data in parallel, and also make Select in each of the sessions, than it will make a set of Full table scans Teradata to produce a data
What is the recommended best practice to load data in parallelised streams and make unnecessary workload to Teradata?
If Tera data supports table partitioning like oracle, you could try reading the table based on partitioning points which will enable parallelism in read...
Other option you have is, split the table into multiple partitions like adding a where clause on indexed column. This will ensure index scan and you can avoid full table scan.
The most scalable way to ingest data into Hadoop form teradata, which i found is to use Teradata connector for hadoop. It is included in Cloudera & Hortonworks distributions. I will show example base on Cloudera documentation, but the same works with Hortonworks as well:
Informatica big Data edition is using standard Scoop invocation via command line and submitting set of parameters to it. So the main question is - which driver to use to make parallel connections between two MPP systems.
Here is the link to the Cloudera documentation:
Using the Cloudera Connector Powered by Teradata
And here is the digest from this documentation (You could find that this connector support different kinds of load balancing between connections):
Cloudera Connector Powered by Teradata supports the following methods for importing data from Teradata to Hadoop:
split.by.amp
split.by.value
split.by.partition
split.by.hash
split.by.amp Method
This optimal method retrieves data from Teradata. The connector creates one mapper per available Teradata AMP, and each mapper subsequently retrieves data from each AMP. As a result, no staging table is required. This method requires Teradata 14.10 or higher.
If you use partition names in the select clause, Power Center will select only the rows within that partition so there won't be duplicate read (don't forget to choose Database partitioning in Informatica session level). However if you use key range partition you have to choose the range as you mentioned in settings. Usually we use NTILE oracle analytical function to split the table into multiple portions so that the read will be unique across the selects. Please let me know if you have any question. If you have range/auto generated/surrogate key column in the table use it in where clause - write a sub-query to divide the table into multiple portions.
I have a reporting framework to build and generate reports (tabular format reports). As of now I used to write SQL query and it used to fetch data from Oracle. Now I have got an interesting challenge where half of data will come from Oracle and remaining data come from MongoDB based on output from Oracle data. Fetched tabular format data from Oracle will have one additional column which will contain key to fetch data from MongoDB. With this I will have two data set in tabular format one from Oracle data and one from MongoDB. Based on one common column I need to merge both table data and produce one data set to produce report.
I can write logic in java code to merge two tables (say data in 2D array format). But instead of doing this from my own, I am thinking to utilize some RDBMS in-memory data concept. For example, H2 database, where I can create two tables in memory on the fly and execute H2 queries to merge two tables. Or, I believe, there could be something in Oracle too like global temp table etc. Could someone please suggest the better approach to join oracle table data with MongoDB collection.
I think you can try and use Kafka and Spark Streaming to solve this problem. Assuming your data is transactional, you can create a Kafka broker and create a topic. Then make change to the existing services where you are saving to Oracle and MongoDB. Create 2 Kafka producers (one for Oracle and another for Mongo) to write the data as streams to the Kafka topic. Then create a consumer group to receive streams from Kafka. You may then aggregate the real time streams using a Spark cluster(You can look at Spark Streaming API for Kafka 1) and save the results back to MongoDB (using Spark Connector from MongoDB 2) or any other distributed database. Then you can do data visualizations/reporting on those results stored in MongoDB.
Another suggestion would be to use apache drill. https://drill.apache.org
You can use a mongo and JDBC drill bits and then you can join oracle tables and mongo collections together.
I have significant amount of data stored on my Hadoop HDFS as Parquet files
I am using Spark streaming to interactively receive queries from a web server and transform the received queries into SQL to run on my data using SparkSQL.
In this process I need to run several SQL queries and then return some aggregate result by merging or subtracting the results of individual queries.
Are there any ways I could optimize and increase the speed of the process by, for example, running queries on already received dataframes rather than the whole database?
Is there a better way to interactively query the Parquet stored data and give results?
Thank you!
If you are running multiple queries on the same RDD you will get a performance increase by caching the RDD with .cache() before querying it.
Also are you sure that Apache Spark is the right tool for the job here? From the interactive queries that you are describing maybe Impala or Presto is more suitable.