Table insert performance bottleneck amazon redshift

Table insert performance bottleneck amazon redshift - jdbc

While inserting records by using batch insert ( https://tool.oschina.net/uploads/apidocs/Spring-3.1.1/org/springframework/jdbc/core/simple/SimpleJdbcInsert.html#executeBatch(org.springframework.jdbc.core.namedparam.SqlParameterSource[]) ) in Redhsift table , the spring framework falls back to one by one insertion and it is taking more time.
(main) org.springframework.jdbc.support.JdbcUtils: JDBC driver does not support batch updates
is there anyway to enable the batch update in redshift table?
if not , Is there anyway to improve the table insertion performance in redshift ?
I tried - adding ?rewriteBatchedStatements=true to the jdbcurl - still the same.

The recommend way of doing batch insert is to use the copy command. Thus, the comon process is to unload data from redshift to S3 using the UNLOAD command (in the case the data you want to insert comes from a query result), and then to run a copy command referencing the data location in S3. This is far more effective than an insert.
UNLOAD ('my SQL statement')
TO 's3::my-s3-target-location'
FORMAT PARQUET;
COPY my_target_table (col1, col2, ...)
FROM 's3::my-s3-target-location'
FORMAT PARQUET;
Here is the documentation:
https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

Related

Quickest way to copy over table data in oracle/postgresql

I'm looking at a spring boot application that is used to copy data from temp to permanent table based on last updated date. It copies only if last updated date is greater than desired date, so not all records are copied over. Currently the table has around 300K+ records and the process with spring JPA is taking over 2 hours (for all of them) and is not at all feasible. The goal is to bring it down to under 15 mins maximum. I'm trying to see how much difference using JDBCtemplate would bring in. Would a pl/sql script be a better option? Also wanted to see if there are better options out there. Appreciate your time.
Using oracle database at the moment but postgresql migration is on the cards.
Thanks!

You can do this operation with a straight SQL query (which will work on Oracle or PostgreSQL). Assuming your temp_table has the same columns as the permanent table, the last updated date column is called last_updated and you wanted to copy all records updated since 2020-05-03 you could write a query like:
INSERT INTO perm_table
SELECT *
FROM temp_table
WHERE last_updated > TO_DATE('2020-05-03', 'YYYY-MM-DD')
In your app you would pass '2020-05-03' via a placeholder either directly or via JdbcTemplate.

hive difference between insert and load data

I am new to Hadoop and Hive, and I am confused about hive's insert into and load data statements.
When I execute INSERT INTO TABLE_NAME (field1, field2) VALUES(value1, value2);, hiveserver will execute mapReduce task.
When I execute LOAD DATA LOCAL INPATH PATH_TO_MY_DATA INTO TABLE TABLE_NAME;, it only loads data from the file and does nothing else.
I wrote a program with Python, here is my problem, if I use pyhs2 and use the insert statement to save data records, each record will execute a MapReduce task, and it is very slow.
Should I first save my data somewhere, and later use the load data statement to load it?

Load
Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Insert
Query Results can be inserted into tables by using the insert clause.
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
in load all the data which in file is copied into table, in insert you can put data based on some condition.
your solution
for every single row you execute your given hql so every time map reduce run.
if you want to execute your query in single mapreduce then
INSERT INTO TABLE students
VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
create a single query and execute it. If you have more records in this condition you can make it a batch.

INSERT OVERWRITE will overwrite any existing data in the table or partition
and INSERT INTO will append to the table or partition keeping the existing
data (ref at apache.org).

Hive Locks entire database when running select on one table

HIVE 0.13 will SHARED lock the entire database(I see a node like LOCK-0000000000 as a child of the database node in Zookeeper) when running a select statement on any table in the database. HIVE creates a shared lock on the entire schema even when running a select statement - this results in a freeze on CREATE/DELETE statements on other tables in the database until the original query finishes and the lock is released.
Does anybody know a way around this? Following link suggests concurrency to be turned off but we can't do that as we are replacing the entire table and we have to make sure that no select statement is accessing the table before we replace the entire contents.
http://mail-archives.apache.org/mod_mbox/hive-user/201408.mbox/%3C0eba01cfc035$3501e4f0$9f05aed0$#com%3E
use mydatabase;
select count(*) from large_table limit 1; # this table is very large and hive.support.concurrency=true`
In another hive shell, meanwhile the 1st query is executing:
use mydatabase;
create table sometable (id string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE ;
The problem is that the “create table” does not execute untill the first query (select) has finished.
Update:
We are using Cloudera's distribution of Hive CDH-5.2.1-1 and we are seeing this issue.

I think they never made such that in Hive 0.13. Please verify your Resource manager and see that you have enough memory when you are executing multiple Hive queries.
As you know each Hive query will trigger a map reduce job and if YARN doesn't have enough resources it will wait till the previous running job completes. Please approach your issue from memory point of view.
All the best !!

Hadoop & Hive as warehouse: daily data deliveries

I am evaluating the combination of hadoop & hive (& impala) as a repolacement for a large data warehouse. I already set up a version and performance is great in read access.
Can somebody give me any hint what concept should be used for daily data deliveries to a table?
I have a table in hive based on a file I put into hdfs. But now I have on a daily basis new transactional data coming in.
How do I add them ti the table in hive.
Inserts are not possible. HDFS cannot append. So whats the gernal concept I need to follow.
Any advice or direction to documentation is appreciated.
Best regards!

Hive allows for data to be appended to a table - the underlying implementation of how this happens in HDFS doesn't matter. There are a number of things you can do append data:
INSERT - You can just append rows to an existing table.
INSERT OVERWRITE - If you have to process data, you can perform an INSERT OVERWRITE to re-write a table or partition.
LOAD DATA - You can use this to bulk insert data into a table and, optionally, use the OVERWRITE keyword to wipe out any existing data.
Partition your data.
Load data into a new table and swap the partition in
Partitioning is great if you know you're going to be performing date based searches and gives you the ability to use options 1, 2, & 3 at either the table or partition level.

Inserts are not possible
Inserts are possible ,like you can create a new table and insert the data from new table to old table.
But simple solution is You can load data of the file into Hive table with the below command.
load data inpath '/filepath' [overwrite] into table tablename;
If you use overwrite then only existing data replced with new data otherwise It is appending only.
You can even schedule the script by creating a shell script.

Best way to bulk insert data into Oracle database

I am going to create a lot of data scripts such as INSERT INTO and UPDATE
There will be 100,000 plus records if not 1,000,000
What is the best way to get this data into Oracle quickly? I have already found that SQL Loader is not good for this as it does not update individual rows.
Thanks
UPDATE: I will be writing an application to do this in C#

Load the records in a stage table via SQL*Loader. Then use bulk operations:
INSERT INTO SELECT (for example "Bulk Insert into Oracle database")
mass UPDATE ("Oracle - Update statement with inner join")
or a single MERGE statement

To keep It as fast as possible I would keep it all in the database.
Use external tables (to allow Oracle to read the file contents),
and create a stored procedure to do the processing.
The update could be slow, If possible, It may be a good idea to consider creating a new table based on all the records in the old (with updates) then switch the new & old tables around.

How about using a spreadsheet program like MS Excel or LibreOffice Calc? This is how I perform bulk inserts.
Prepare your data in a tabular format.
Let's say you have three columns, A (text), B (number) & C (date). In the D column, enter the following formula. Adjust accordingly.
="INSERT INTO YOUR_TABLE (COL_A, COL_B, COL_C) VALUES ('"&A1&"', "&B1&", to_date ('"&C1&"', 'mm/dd/yy'));"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Table insert performance bottleneck amazon redshift - jdbc

Related

Quickest way to copy over table data in oracle/postgresql

hive difference between insert and load data

Hive Locks entire database when running select on one table

Hadoop & Hive as warehouse: daily data deliveries

Best way to bulk insert data into Oracle database

Categories

Resources