my greenplum insert slow - greenplum

Bulk data insertion from greenplum from Oracle through JDBC, plain text data, storage speed is very slow, 200 per second. Is there any good solution?
I try to insert data from Oracle to HDFS, the same configuration, 20,000 data per second.

Related

Oracle Blob object ingestion into Snowflake

So I'm aware that Snowflake doesn't really have an Oracle Blob equivalent, but i'm just curious how are other out there addressing the need for having Blob data from Oracle in their data-warehouse? Specifically where the general 16MB limit on VARCHAR and 8MB limit on Binary is not enough.
These are some examples I have come across for "Specifically where the general 16MB limit on VARCHAR and 8MB limit on Binary is not enough"
Storing ore that 16mb data in Snowflake - Variant 16mb of compressed data
How To Load Data Into Snowflake – Snowflake Data Load Best Practices
- Using a Snowflake stage is a great way plan the upload

Storing data in HBase vs Parquet files

I am new to big data and am trying to understand the various ways of persisting and retrieving data.
I understand both Parquet and HBase are column oriented storage formats but Parquet is a file oriented storage and not a database unlike HBase.
My questions are :
What is the use case of using Parquet instead HBase
Is there a use case where Parquet can be used together with HBase.
In case of performing joins will Parquet be better performant than
HBase (say, accessed through a SQL skin like Phoenix)?
As you have already said in question, parquet is a storage while HBase is storage(HDFS) + Query Engine(API/shell) So a valid comparison should be done between parquet+Impala/Hive/Spark and HBase. Below are the key differences -
1) Disk space - Parquet takes less disk space in comparison to HBase. Parquet encoding saves more space than block compression in HBase.
2) Data Ingestion - Data ingestion in parquet is more efficient than HBase. A simple reason could be point 1. As in case of parquet, less data needs to be written on disk.
3) Record lookup on key - HBase is faster as this is a key-value storage while parquet is not. Indexing in parquet will be supported in future release.
4) Filter and other Scan queries - Since parquet store more information about records stored in a row group, it can skip lot of records while scanning the data. This is the reason, it's faster than HBase.
5) Updating records - HBase provides record updates while this may be problematic in parquet as the parquet files needs to be re-written. A careful design of schema and partitioning may improve updates but it's not comparable with HBase.
By comparing the above features, HBase seems more suitable for situations where updates are required and queries involve mainly key-value lookup. Query involving key range scan will also have better performance in HBase.
Parquet is suitable for use cases where updates are very few and queries involves filters, joins and aggregations.

Why in HIVE joins take significant time in execution?

I am trying to join two table in hive having almost same number of records. The query is taking a long time for execution.
Why in hive JOINS take a long time to execute?
The number of records is approx 50k in both tables.
Hive query converts to Map Reduce internally and gets executed because of which it will take few mins to execute it. There are different ways by which you can improve the performance. You can follow this link to improve your query performance.
The main reason for using hive or Hadoop is handling huge volume of data. So you will be seeing definitely huge performance gain as compared to other relational databases when you are handling huge data. But for the amount of data which you are mentioning probably is not a good usecase for Hive.

Advantage of creating Hive partitions when using parquet file storage

Is it any advantage to create Hive partitions when using parquet file storage. Parquet is columnar storage file formats which stores data in column chunks with all the columns stored sequentially by index. When we query select a column based on a predicate, the select column index will jump to the required range based on predicate and print the values. How will partitioning be helpful? In row-oriented hive tables, partitioning is helpful because we'll hit only specified required range of data but Im not able to understand how will it be helpful in parquet storage.
In non-partitioned tables,hive would have to read all the files in the
table's data directory and then apply filters on it.For large table it is slow and expensive.
In partition tables,it will create subdirectories based on partition column.It distribute execution load horizontally and no need to search entire table columns for a single records.
The parquet file format have better compression but performance is not that good.
The partition with parquet reduce the execution time of query.eg.when i executed filter query on parquet table, it took 29.657 seconds whereas partition with parquet format took 14.21 seconds.If there is large table then definitely it will improve the performance of query.

Best way to read a large csv (10gb) and after computation store the data in db

I have ~4gb of text file which I parse and save the data in a db. This process almost take 3-4hr(5-6 million lines) to process and save data in db. And this is a everyday process.
Now when I query the db its taking too much time to compute result and return. Like if I do a simple avg, sum operation for a particular day its taking 30-40mins.
I am using python, mysql right now. Tried Spark also to do this computation which also taking 30-40 min and now data is increasing so file size will increase and it will be like 10gb, which spark is not able to handle large files.
Please suggest how can I improve this time of parsing, storing in db, and fetching time.
I do not know what database you are using, but maybe you could switch?
I suggest using Impala + AVRO schema. You will probably need to refresh/create table using HIVE, as Impala lacks some functionalities in the administrative area.
I've used it storing files on HDFS and grouping and then summing 45GB of float took me about 40 seconds on 4 machines. You spend no time putting anything to database as the source are files themselves. All time you need is to store files in HDFS, but it's as fast as any FS.

Resources