Tracking the status of vbr from vsql in vertica - vertica

Is there any way I can track the status of Vertica replication process done using vbr using vsql? It's required because only one instance of vbr can be run at a time.

No, you can't, because most of work do rsync and DBMS almost not involved(vsql connects to DBMS only).

Related

How to increase the performance of insert data from mongo to greenplum with PDI(kettle)?

I use PDI(kettle) to extract the data from mongodb to greenplum. I tested if extract the data from mongodb to file, it was faster, about 10000 rows per second. But if extract into greenplum, it is only about 130 per second.
And I modified following parameters of greenplum, but it is no significant improvement.
gpconfig -c log_statement -v none
gpconfig -c gp_enable_global_deadlock_detector -v on
And if I want to add the number of output table. It seems to be hung up and no data will be inserted for a long time. I don't know why?
How to increase the performance of insert data from mongo to greenplum with PDI(kettle)?
Thank you.
There are a variety of factors that could be at play here.
Is PDI loading via an ODBC or JDBC connection?
What is the size of data? (row count doesn't really tell us much)
What is the size of your Greenplum cluster (# of hosts and # of segments per host)
Is the table you are loading into indexed?
What is the network connectivity between Mongo and Greenplum?
The best bulk load performance using data integration tools such as PDI, Informatica Power Center, IBM Data Stage, etc.. will be accomplished using Greenplum's native bulk loading utilities gpfdist and gpload.
Greenplum love batches.
a) You can modify batch size in transformation with Nr rows in rowset.
b) You can modify commit size in table output.
I think a and b should match.
Find your optimum values. (For example we use 1000 for rows with big json objects inside)
Now, using following connection properties
reWriteBatchedInserts=true
It will re-write SQL from insert to batched insert. It increase ten times insert performance for my scenario.
https://jdbc.postgresql.org/documentation/94/connect.html
Thank you guys!

clickhouse-client : very high memory usage

I have loaded ontime dataset on clickhouse-server running on wsl2. Everything is working fine on the server side but suddently clickhouse-client started taking huge memory as evident in the given htop output.
It happened when just a simple group by query executed,
select year,count(1) from datasets.ontime group by year
I had to shutdown wsl to recover from this.
Please let me know if I am doing anything wrong !
Note : I have changed the partition instruction of ontime data set to YEAR and ORIGINSTATE
i.e. : PARTITION BY (Year,OriginState)
Clickhouse version : 21.4.5.46 (official build)
htop output with client + server
Clickhouse was designed for servers with 64GB+ RAM (period).
clickhouse-client tries AST parse and rewrite query on client-side
also, it tries to receive all metadata for auto-complete
how much partitions do you have after ALTER TABLE ontime?
Try clean client history file: .clickhouse-client-history
~/.clickhouse-client-history
For some reason clichouse-client load clickhouse-client-history file in memory, almost 5Gb of RAM in my case. (ClickHouse client version 21.7.4.18).

If we using 6 mapper in sqoop to importing the data from Oracle, then how many connection will be establish between sqoop and source

If we using 6 mapper in sqoop to importing the data from Oracle, then how many connection will be establish between sqoop and source.
Will it be a single connection or it will be 6 connections for each mapper.
As per sqoop docs:
Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result.
That means all the mappers will make concurrent connections.
Also keep in mind, if your table has 2 records only, then sqoop will only use 2 mappers not all the 6 mappers.
Check my other answer to understand concept of number of mappers in Sqoop command.
EDIT:
All the mappers will make inactive connections as JDBC client program. Then active connections (which actually fires SQL query) will be shared among multiple mappers.
Fire SQOOP IMPORT command in -verbose mode, you will see logs -
DEBUG manager.OracleManager$ConnCache: Got cached connection for jdbc:oracle:thin:#192.xx.xx.xx:1521:orcl/dev
DEBUG manager.OracleManager$ConnCache: Caching released connection for jdbc:oracle:thin:#192.xx.xx.xx:1521:orcl/dev
Check getConnection and recycle methods for more details.
Each map task will get a DB connection. so in your case 6 maps then 6 connections. please visit github/sqoop to see how it was implemented
-m specify the number of mapper task will be running as part of the Job.
so more number of mappers then more number of connections.
It probably depends on Manager but I guess all of them likely to create one. Take DirectPostgresSqlManager. It creates one connection per mapper through psql COPY TO STDOUT
Please take a look at managers at
Sqoop Managers

When should I use Greenplum Database versus HAWQ?

We are having use case for retail industry data. We are into making of EDW.
We are currently doing reporting from HAWQ.But We wanted to shift our MPP database from Hawq into Greenplum.
Basically,We would like to make changes into current data pipeline.
Our confusion points about gpdb :
HOW gpdb layer going to affect our existing data pipeline. Here data
pipeline is external system --> talend -->hadoop-hawq-->tableau. We
want to transform our data pipeline as external system --> talend
-->hadoop-hawq-->greenplum -->tableau.
How Greenplum is physically or logically going to help in SQL
transformation and reporting.
Which file format should i opt for storing the files in GPDB while
HAWQ we are storing files in plain text format.What are the supported format is good for writing in gpdb like avro,parquet etc.
How is data file processed from GPDB . so, that
it also bring faster reporting and predictive analysis.
Is there any way to push data from HAWQ into Greenplum? We are
looking for guidance how to take shift our reporting use case from
HAWQ INTO Greenplum.
Any help on it would be much appreciated?
This query is sort of like asking, "when should I use a wrench?" The answer is also going to be subjective as Greenplum can be used for many different things. But, I will do my best to give my opinion because you asked.
HOW gpdb layer going to affect our existing data pipeline. Here data pipeline is external system --> talend -->hadoop-hawq-->tableau. We want to transform our data pipeline as external system --> talend -->hadoop-hawq-->greenplum -->tableau.
There are lots of ways to do the data pipeline your goal of loading data into Hadoop first and then load it to Greenplum is very common and works well. You can use External Tables in Greenplum to read data in parallel, directly from HDFS. So the data movement from the Hadoop cluster to Greenplum can be achieved with a simple INSERT statement.
INSERT INTO greenplum_customer SELECT * FROM hdfs_customer_file;
How Greenplum is physically or logically going to help in SQL transformation and reporting.
Isolation for one. With a separate cluster for Greenplum, you can provide analytics to your customers without impacting the performance of your Hadoop activity and vice-versa. This isolation also can provide an additional security layer.
Which file format should i opt for storing the files in GPDB while
HAWQ we are storing files in plain text format.What are the supported format is good for writing in gpdb like avro,parquet etc.
With your data pipeline as you suggested, I would make the data format decision in Greenplum based on performance. So large tables, partition the tables and make it column oriented with quicklz compression. For smaller tables, just make it append optimized. And for tables that have lots of updates or deletes, keep it the default heap.
How is data file processed from GPDB . so, that it also bring faster reporting and predictive analysis.
Greenplum is an MPP database. The storage is "shared nothing" meaning that each node has unique data that no other node has (excluding mirroring for high-availability). A segment's data will always be on the local disk.
In HAWQ, because it uses HDFS, the data for the segment doesn't have to be local. Day 1, when you wrote the data to HDFS, it was local but after failed nodes, expansion, etc, HAWQ may have to fetch the data from other nodes. This makes Greenplum's performance a bit more predictable than HAWQ because of how Hadoop works.
Is there any way to push data from HAWQ into Greenplum? We are
looking for guidance how to take shift our reporting use case from
HAWQ INTO Greenplum.
Push, no but pull, Yes. As I mentioned above, you can create an External Table in Greenplum to SELECT data from HDFS. You can also create Writable External Tables in Greenplum to push data to HDFS.

Will hadoop(sqoop) load oracle faster than SQL loader?

We presently load CDRs to an oracle warehouse using a combination of bash shell scripts and SQL loader with multiple threads. We are hoping to offload this process to hadoop because we envisage that the increase in data due to increase in subscriber base will soon max out the current system. And we also want to gradually introduce hadoop into our data warehouse environment.
Will loading from hadoop be faster?
If so what's is the best set of hadoop tool for this?
Further info:
We usually will get contunoius stream of pipe delimited text files through ftp to a folder, add two more fields to each record, load to temp tables in oracle and run a procedure to load to final table. How would u advice the process flow to be in terms of tools to use. For example;
files are ftp to the Linux file system (or is possible to ftp straight to hadoop?) and flume loads to hadoop.
fields are added (what will be best to do this? Pig, hive, spark or any other recommendations)
files are then loaded to oracle using sqoop
the final procedure is called(can sqoop make an oracle procedure call? If not what tool will be best to execute procedure and help control the whole process ?)
Also how can one control the level of paralleism ? Does it equate the number of mappers running the job?
Had a similar task of exporting data from a < 6 node Hadoop cluster to an Oracle Datewarehouse.
I've tested the following:
Sqoop
OraOop
Oracle Loader for Hadoop from the "Oracle BigData Connectors" suite
Hadoop streaming job which uses sqloader as mapper, in its configuration you can read from stdin using: load data infile "-"
Considering just speed, the Hadoop streaming job with sqloader as a mapper was the fastest way to transfer the data, but you have to install sqloader on each machine of your cluster. It was more of a personal curiosity, I would not recommend using this way to export data, the logging capabilities are limited, and should have a bigger impact on your datawarehouse performance.
The winner was Sqoop, it is pretty reliable, it's the import/export tool of the Hadoop ecosystem and was second fastest solution, according to my tests.(1.5x slower than first place)
Sqoop with OraOop (last updated 2012) was slower than the latest version of Sqoop, and requires extra configuration on the cluster.
Finally the worst time was obtained using Oracle's BigData Connectors, if you have a big cluster(>100 machines) than it should not be as bad as the time I obtained. The export was done in two steps. First step involves reprocessing the output and converting it to an Oracle Format that plays nice with the Datawarehouse. The second step was transfering the result to the Datawarehouse. This approach is better if you have allot of processing power, and you would not impact the Datawarehouse's performace as much as the other solutions.

Resources