How to increase the performance of insert data from mongo to greenplum with PDI(kettle)? - performance

I use PDI(kettle) to extract the data from mongodb to greenplum. I tested if extract the data from mongodb to file, it was faster, about 10000 rows per second. But if extract into greenplum, it is only about 130 per second.
And I modified following parameters of greenplum, but it is no significant improvement.
gpconfig -c log_statement -v none
gpconfig -c gp_enable_global_deadlock_detector -v on
And if I want to add the number of output table. It seems to be hung up and no data will be inserted for a long time. I don't know why?
How to increase the performance of insert data from mongo to greenplum with PDI(kettle)?
Thank you.

There are a variety of factors that could be at play here.
Is PDI loading via an ODBC or JDBC connection?
What is the size of data? (row count doesn't really tell us much)
What is the size of your Greenplum cluster (# of hosts and # of segments per host)
Is the table you are loading into indexed?
What is the network connectivity between Mongo and Greenplum?
The best bulk load performance using data integration tools such as PDI, Informatica Power Center, IBM Data Stage, etc.. will be accomplished using Greenplum's native bulk loading utilities gpfdist and gpload.

Greenplum love batches.
a) You can modify batch size in transformation with Nr rows in rowset.
b) You can modify commit size in table output.
I think a and b should match.
Find your optimum values. (For example we use 1000 for rows with big json objects inside)

Now, using following connection properties
reWriteBatchedInserts=true
It will re-write SQL from insert to batched insert. It increase ten times insert performance for my scenario.
https://jdbc.postgresql.org/documentation/94/connect.html
Thank you guys!

Related

Spark execution using jdbc

In the Spark dataframe, suppose I fetch data from oracle as below.
Will the query happen completely in oracle? Assume the query is huge. Is it an overhead to oracle then? Would a better approach be to read each filtered table data in a separate dataframe and join it using spark SQL or dataframe so that a complete join will happen in Spark? Can you please help with this?
df = sqlContext.read.format('jdbc').options(
url="jdbc:mysql://foo.com:1111",
dbtable="(SELECT * FROM abc,bcd.... where abc.id= bcd.id.....) AS table1", user="test",
password="******",
driver="com.mysql.jdbc.Driver").load()
In general, actual data movement is the most time consuming and should be avoided. So, as general rule, you want to filter as much as possible in the JDBC source (Oracle in your case) before data are moved into your Spark environment.
Once you're ready to do some analysis in Spark, you can persist (cache) the result so as to avoid re-retrieving from Oracle every time.
That being said, #shrey-jakhmola is right, you want to benchmark for your particular circumstance. Is the Oracle environment choked somehow, perhaps?

How to increase cassandra yaml file performance to have fast write query?

when i tried to insert one million rows of data in cassandra database such as i have performance hardware, take one minutes for insert it. Anyone can help me please how to increase settings by default of cassandra, i want exactlly what are the main parameteres wanto to modify in cassandra yaml file or cassandra-env file to have best performance to fit well with my server
i tried with using cql commands
CQL Command: Copy table_name(field1, ..., fieldn) to 'path_file'
9000 rows in seconds, 56 millions rows inserted in 54 minutes

Nifi Fetching Data From Oracle Issue

I am having a requirement to fetch data from oracle and upload into google cloud storage.
I am using executeSql proecssor but it is failing for large table and even for table with 1million records of approx 45mb size it is taking 2hrs to pull.
The table name are getting passed using restapi to listenHttp which passes them to executeSql. I cant use QueryDatabase because the number of table are dynamic and calls to start the fetch is also dynamic using a UI and Nifi RestUi.
Please suggest any tuning parameter in ExecuteSql Processor.
I believe you are talking about having the capability to have smaller flow files and possibly sending them downstream while the processor is still working on the (large) result set. For QueryDatabaseTable this was added in NiFi 1.6.0 (via NIFI-4836) and in an upcoming release (NiFi 1.8.0 via NIFI-1251) this capability will be available for ExecuteSQL as well.
You should be able to use GenerateTableFetch to do what you want. There you can set the Partition Size (which will end up being the number of rows per flow file) and you don't need a Maximum Value Column if you want to fetch the entire table each time a flow file comes in (which also allows you do handle multiple tables as you described). GenerateTableFetch will generate the SQL statements to fetch "pages" of data from the table, which should give you better, incremental performance on very large tables.

Apache Kafka consumer takes long time when used as a plugin in pentaho

Consumer-transformation Producer-transformation
OBJECTIVE :
Transfer tables(120 tables) from oracle database to vertica database.
Current practice:
Use pentaho tool to extract data from oracle database and store it as a file and load them again into vertica database.
Problem faced :
The entire process running for a long time.
Storing it as file occupies more space and reduces performance.
New approach:
Use Kafka as a messaging system and its plugin in pentaho.
Problem faced:
Consumer plugin takes huge amount of time for consuming message and loading into vertica tables.(6 times the time taken for loading message into producer).
1. Avro format
2. Sample of 2 million records with 200 columns
We would like to hear suggestions to improve this performance or suggest any other way to meet the objective using Kafka.
This document suggest to use the Vertica Bulk Loader step directly after the Oracle Table input.

increase efficiency of sqoop export from hdfs

I am trying to export data using sqoop from files stored in hdfs to vertica. For around 10k's of data the files get loaded within a few minutes. But when I try to run crores of data, it is loading around .5% within 15 mins or so. I have tried to increase the number of mappers, but they are not serving any purpose to improve efficienct. Even setting the chunk size to increase the number the mappers, does not increase the number.
Please help.
Thanks!
As you are using Batch export try increasing the records per transaction and records per statement parameter using the following properties:
sqoop.export.records.per.statement : property will aggregate multiple rows inside one single insert statement.
sqoop.export.records.per.transaction: how many insert statements will be issued per transaction
I hope these will surely solves the issue.
Most MPP/RDBMS have sqoop connectors to exploit the parallelism and increase efficiency in transfer of data between HDFS and MPP/RDBMS. However it seems the vertica has taken this approach: http://www.vertica.com/2012/07/05/teaching-the-elephant-new-tricks/
https://github.com/vertica/Vertica-Hadoop-Connector
Is this a "wide" dataset? It might be a sqoop bug https://issues.apache.org/jira/browse/SQOOP-2920 if number of columns is very high (in hundreds), sqoop starts choking (very high on cpu). When number of fields is small, it's usually other way around - when sqoop is bored and rdbms systems can't keep up.

Resources