Setting Physical options for existing tables - ddl

How do I set following Physical options for existing Greenplum tables using sql-script:
with(
appendonly = true,
orientation = column,
compresstype = zstd,
compresslevel = 1
) distributed replicated

Currently, Greenplum doesn't support altering storage types/characteristics for the existing tables. Would have to recreate the table. This enhancement is already tracked via https://github.com/greenplum-db/gpdb/issues/5300. Please feel free to provide insight into your use-case on that GitHub issue.
Distribution of the table though can be changed via ALTER TABLE command for example
-- hash distributed table
CREATE TABLE foo (a int, b int) distributed by (a);
-- alter to convert it to replicated
ALTER TABLE foo set distributed replicated;

Related

How to create an AWS Athena Table with Partition Projection and Bucketing enabled?

I am trying to Create an Athena Table that makes use of both Projected Partitioning and Bucketing (CLUSTERED BY). I'm doing this to get a side by side performance comparison for our dataset with and without using Bucketing. Through my tests, this does not seem to be supported. But I cannot find anything in the documentation that explicitly states this. So I'm assuming that I'm missing something. Bucketing works with normal Partitioning, but I'm trying to make use Projected Partitioning so that I do not have to maintain the Partitions in the Glue Catalog.
This my setup. I have an existing Athena Table that is setup to read Gzipped Parquet files on S3. This all works. In order to create the Bucketed version of my Table(s), I'm using Athena CTAS to create Bucketed Gzipped Parquet Files. The CTAS files are written to a staging location and then I move them to a location that fits my storage structure. I then try to create a new Table that points to the bucketed data and try to enable Projected Partitioning and Bucketing in the Table setup. I've used both Athena SQL and AWS Wrangler's create_parquet_table to do this.
Here is the original CTAS SQL that creates the Bucketed Files:
CREATE TABLE database_name.ctas_table_name
WITH (
external_location = 's3://bucket/staging_prefix',
partitioned_by = ARRAY['partition_column'],
bucketed_by = ARRAY['index_column'],
bucket_count = 10,
format = 'PARQUET'
)
AS
SELECT
index_column,
partition_column
FROM database_name.table_name;
The files produced from the above CTAS are then moved from the staging location to the actual location, let's call it 's3://bucket/table_prefix'. This results in a s3 structure like:
s3://bucket/table_prefix/partition_column=xx/file.bucket_000.gzip.parquet
s3://bucket/table_prefix/partition_column=xx/file.bucket_001.gzip.parquet
...
s3://bucket/table_prefix/partition_column=xx/file.bucket_009.gzip.parquet
So 10 Bucketed file per partition.
Then the SQL to create the Table on Athena
CREATE TABLE database_name.table_name (
index_column bigint,
partition_column bigint
)
CLUSTERED BY (index_column) INTO 10 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://bucket/table_prefix'
TBLPROPERTIES (
'classification'='parquet',
'compressionType'='gzip',
'projection.enabled'='true',
'projection.index_column.type'='integer',
'projection.index_column.range'='0,3650',
'typeOfData'='file'
);
If I submit this last CREATE TABLE SQL, it succeeds. However, when selecting from the table, I get the following error message:
HIVE_INVALID_BUCKET_FILES: Hive table 'database_name.table_name' is corrupt. Found sub-directory in bucket directory for partition: <UNPARTITIONED>
If I try to create the Table using the affore mentioned awswrangler.catalog.create_parquet_table, which looks like this:
response = awswrangler.catalog.create_parquet_table(
boto3_session = boto3_session,
database = 'database_name',
table = 'table_name',
path = 's3://bucket/table_prefix',
partitions_types = {"partition_column": "bigint"},
columns_types = {"index_column": "bigint", "partition_column": "bigint"},
bucketing_info = (["index_column"], 10),
compression = 'gzip',
projection_enabled = True,
projection_types = {"partition_column": "integer"},
projection_ranges = {"partition_column": "0,3650"},
)
This API call raises the following exception:
awswrangler.exceptions.InvalidArgumentCombination: Column index_column appears as projected column but not as partitioned column.
This does not make sense, as it clearly is there. I believe this to be a red herring in any case. If I remove the <bucketing_info> parameter, it all works. Inversely, if I remove the <projection...> parameters, it all works.
So from what I can gather, Partition Projection is not compatible with Bucketing. But this is not made clear in the documentation, nor could I find anything online to support this. So I'm asking here if anyone knows what is going on?
Did I make a mistake in my setup?
Did I miss a piece of AWS Athena documentation that states this is not possible.
Or is this an undocumented incompatibility?
Or... aliens??

How to use external table in hive?

Can anyone please explain why and where do we use external tables in hive?
Please explain a scenario to understand easily.
We use external table when our underlying dataset pointed by hive table is shared by many purpose i.e for map reduce job, pig etc and use managed table in hive when our dataset pointed by hive table is used only by hive application.
Actually in hive managed table has full control on dataset i.e in managed table if you will drop the table dataset will also be deleted from hive warehouse(/usr/hive/warehouse) present in HDFS, but in case of external table when you drop the table, dataset are not deleted from hive warehouse in HDFS.
Suppose take an example you have 50 gb data set now if you create multiple copies of dataset for different purpose it will simply take more space so the better option is to use external table so that when you drop the table dataset are not deleted and you can use it further by any other application like by pig or by any other purpose.
As a rule of thumb: use external table if you plan to work with those data not only from Hive but from other frameworks as well. Otherwise make it internal.
The only difference between External and Managed table in Hive is Drop table or Drop partition behavior. For Managed it will drop data as well, for External table the data will remain untouched in the table/partition location.
Use External in most cases. External table allows you to change table definition easily. Also you can create few tables on top of the same location.
Use Managed table if the table is temporary/intermediate and data should be deleted to free space.
Managed table can be converted to external and vice-versa using
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');

oracle synchronize 2 tables

I have the following scenario and need to solve it in ORACLE:
Table A is on a DB-server
Table B is on a different server
Table A will be populated with data.
Whenever something is inserted to Table A, i want to copy it to Table B.
Table B nearly has similar columns, but sometimes I just want to get
the content from 2 columns from tableA and concatenate it and save it to
Table B.
I am not very familiar with ORACLE, but after researching on GOOGLE
some say that you can do it with TRIGGERS or VIEWS, how would you do it?
So in general, there is a table which will be populated and its content
should be copien to a different table.
This is the solution I came up so far
create public database link
other_db
connect to
user
identified by
pw
using 'tns-entry';
CREATE TRIGGER modify_remote_my_table
AFTER INSERT ON my_table
BEGIN INSERT INTO ....?
END;
/
How can I select the latest row that was inserted?
If the databases of these two tables are in two different servers, then you will need a database link (db-link) to be created in Table A schema so that it can access(read/write) the Table B data using db-link.
Step 1: Create a database link in Table A server db pointing to Table B server DB
Step 2: Create a trigger for Table A, which helps in inserting data to the table B using database link. You can customize ( concatenate the values) inside the trigger before inserting it into table B.
This link should help you
http://searchoracle.techtarget.com/tip/How-to-create-a-database-link-in-Oracle
Yes you can do this with triggers. But there may be a few disadvantages.
What if database B is not available? -> Exception handling in you trigger.
What if database B was not available for 2h? You inserted data into database A which is now missing in database B. -> Do crazy things with temporarily inserting it into a cache table in database A.
Performance. Well, the performance for inserting a lot of data will be ugly. Each time you insert data, Oracle will start the PL/SQL engine to insert the data into the remote database.
Maybe you could think about using MViews (Materialized Views) to replicate the data via database link. Later you can build your queries so that they access tables from database B and add the required data from database A by joining the MViews.
You can also use fast refresh to replicate the data (almost) realtime.
From perspective of an Oracle Database Admin this would make a lot more sense than the trigger approach.
try this code
database links are considered rather insecure and oracle own options are having licences associated these days, some of the other options are deprecated as well.
https://gist.github.com/anonymous/e3051239ba401e416565cdd912e0de8c
uses ora_rowscn to sync tables across two different oracle databases.

How to delete and update a record in Hive

I have installed Hadoop, Hive, Hive JDBC. which are running fine for me. But I still have a problem. How to delete or update a single record using Hive because delete or update command of MySQL is not working in Hive.
Thanks
hive> delete from student where id=1;
Usage: delete [FILE|JAR|ARCHIVE] <value> [<value>]*
Query returned non-zero code: 1, cause: null
As of Hive version 0.14.0: INSERT...VALUES, UPDATE, and DELETE are now available with full ACID support.
INSERT ... VALUES Syntax:
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
Where values_row is:
( value [, value ...] )
where a value is either null or any valid SQL literal
UPDATE Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
DELETE Syntax:
DELETE FROM tablename [WHERE expression]
Additionally, from the Hive Transactions doc:
If a table is to be used in ACID writes (insert, update, delete) then the table property "transactional" must be set on that table, starting with Hive 0.14.0. Without this value, inserts will be done in the old style; updates and deletes will be prohibited.
Hive DML reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
Hive Transactions reference:
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
You should not think about Hive as a regular RDBMS, Hive is better suited for batch processing over very large sets of immutable data.
The following applies to versions prior to Hive 0.14, see the answer by ashtonium for later versions.
There is no operation supported for deletion or update of a particular record or particular set of records, and to me this is more a sign of a poor schema.
Here is what you can find in the official documentation:
Hadoop is a batch processing system and Hadoop jobs tend to have high latency and
incur substantial overheads in job submission and scheduling. As a result -
latency for Hive queries is generally very high (minutes) even when data sets
involved are very small (say a few hundred megabytes). As a result it cannot be
compared with systems such as Oracle where analyses are conducted on a
significantly smaller amount of data but the analyses proceed much more
iteratively with the response times between iterations being less than a few
minutes. Hive aims to provide acceptable (but not optimal) latency for
interactive data browsing, queries over small data sets or test queries.
Hive is not designed for online transaction processing and does not offer
real-time queries and row level updates. It is best used for batch jobs over
large sets of immutable data (like web logs).
A way to work around this limitation is to use partitions: I don't know what you id corresponds to, but if you're getting different batches of ids separately, you could redesign your table so that it is partitioned by id, and then you would be able to easily drop partitions for the ids you want to get rid of.
Yes, rightly said. Hive does not support UPDATE option.
But the following alternative could be used to achieve the result:
Update records in a partitioned Hive table:
The main table is assumed to be partitioned by some key.
Load the incremental data (the data to be updated) to a staging table partitioned with the same keys as the main table.
Join the two tables (main & staging tables) using a LEFT OUTER JOIN operation as below:
insert overwrite table main_table partition (c,d)
select t2.a, t2.b, t2.c,t2.d from staging_table t2 left outer join main_table t1 on t1.a=t2.a;
In the above example, the main_table & the staging_table are partitioned using the (c,d) keys. The tables are joined via a LEFT OUTER JOIN and the result is used to OVERWRITE the partitions in the main_table.
A similar approach could be used in the case of un-partitioned Hive table UPDATE operations too.
You can delete rows from a table using a workaround, in which you overwrite the table by the dataset you want left into the table as a result of your operation.
insert overwrite table your_table
select * from your_table
where id <> 1
;
The workaround is useful mostly for bulk deletions of easily identifiable rows. Also, obviously doing this can muck up your data, so a backup of the table is adviced and care when planning the "deletion" rule also adviced.
Once you have installed and configured Hive , create simple table :
hive>create table testTable(id int,name string)row format delimited fields terminated by ',';
Then, try to insert few rowsin test table.
hive>insert into table testTable values (1,'row1'),(2,'row2');
Now try to delete records , you just inserted in table.
hive>delete from testTable where id = 1;
Error!
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.
By default transactions are configured to be off. It is been said that update is not supported with the delete operation used in the conversion manager. To support update/delete , you must change following configuration.
cd $HIVE_HOME
vi conf/hive-site.xml
Add below properties to file
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<property>
<name>hive.enforce.bucketing</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
<property>
<name>hive.txn.manager</name>
<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
<name>hive.compactor.initiator.on</name>
<value>true</value>
</property>
<property>
<name>hive.compactor.worker.threads</name>
<value>2</value>
</property>
Restart the service and then try delete command again :
Error!
FAILED: LockException [Error 10280]: Error communicating with the metastore.
There is problem with metastore. In order to use insert/update/delete operation, You need to change following configuration in conf/hive-site.xml as feature is currently in development.
<property>
<name>hive.in.test</name>
<value>true</value>
</property>
Restart the service and then delete command again :
hive>delete from testTable where id = 1;
Error!
FAILED: SemanticException [Error 10297]: Attempt to do update or delete on table default.testTable that does not use an AcidOutputFormat or is not bucketed.
Only ORC file format is supported in this first release. The feature has been built such that transactions can be used by any storage format that can determine how updates or deletes apply to base records (basically, that has an explicit or implicit row id), but so far the integration work has only been done for ORC.
Tables must be bucketed to make use of these features. Tables in the same system not using transactions and ACID do not need to be bucketed.
See below built table example with ORCFileformat, bucket enabled and ('transactional'='true').
hive>create table testTableNew(id int ,name string ) clustered by (id) into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');
Insert :
hive>insert into table testTableNew values (1,'row1'),(2,'row2'),(3,'row3');
Update :
hive>update testTableNew set name = 'updateRow2' where id = 2;
Delete :
hive>delete from testTableNew where id = 1;
Test :
hive>select * from testTableNew ;
Configuration Values to Set for INSERT, UPDATE, DELETE
In addition to the new parameters listed above, some existing parameters need to be set to support INSERT ... VALUES, UPDATE, and DELETE.
Configuration key
Must be set to
hive.support.concurrency true (default is false)
hive.enforce.bucketing true (default is false) (Not required as of Hive 2.0)
hive.exec.dynamic.partition.mode nonstrict (default is strict)
Configuration Values to Set for Compaction
If the data in your system is not owned by the Hive user (i.e., the user that the Hive metastore runs as), then Hive will need permission to run as the user who owns the data in order to perform compactions. If you have already set up HiveServer2 to impersonate users, then the only additional work to do is assure that Hive has the right to impersonate users from the host running the Hive metastore. This is done by adding the hostname to hadoop.proxyuser.hive.hosts in Hadoop's core-site.xml file. If you have not already done this, then you will need to configure Hive to act as a proxy user. This requires you to set up keytabs for the user running the Hive metastore and add hadoop.proxyuser.hive.hosts and hadoop.proxyuser.hive.groups to Hadoop's core-site.xml file. See the Hadoop documentation on secure mode for your version of Hadoop (e.g., for Hadoop 2.5.1 it is at Hadoop in Secure Mode).
The UPDATE statement has the following limitations:
The expression in the WHERE clause must be an expression supported by a Hive SELECT clause.
Partition and bucket columns cannot be updated.
Query vectorization is automatically disabled for UPDATE statements. However, updated tables can still be queried using vectorization.
Subqueries are not allowed on the right side of the SET statement.
The following example demonstrates the correct usage of this statement:
UPDATE students SET name = null WHERE gpa <= 1.0;
DELETE Statement
Use the DELETE statement to delete data already written to Apache Hive.
DELETE FROM tablename [WHERE expression];
The DELETE statement has the following limitation:
query vectorization is automatically disabled for the DELETE operation.
However, tables with deleted data can still be queried using vectorization.
The following example demonstrates the correct usage of this statement:
DELETE FROM students WHERE gpa <= 1,0;
The CLI told you where is your mistake : delete WHAT? from student ...
Delete : How to delete/truncate tables from Hadoop-Hive?
Update : Update , SET option in Hive
If you want to delete all records then as a workaround load an empty file into table in OVERWRITE mode
hive> LOAD DATA LOCAL INPATH '/root/hadoop/textfiles/empty.txt' OVERWRITE INTO TABLE employee;
Loading data to table default.employee
Table default.employee stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
OK
Time taken: 0.19 seconds
hive> SELECT * FROM employee;
OK
Time taken: 0.052 seconds
Upcoming version of Hive is going to allow SET based update/delete handling which is of utmost importance when trying to do CRUD operations on a 'bunch' of rows instead of taking one row at a time.
In the interim , I have tried a dynamic partition based approach documented here http://linkd.in/1Fq3wdb .
Please see if it suits your need.
UPDATE or DELETE a record isn't allowed in Hive, but INSERT INTO is acceptable.
A snippet from Hadoop: The Definitive Guide(3rd edition):
Updates, transactions, and indexes are mainstays of traditional databases. Yet, until recently, these features have not been considered a part of Hive's feature set. This is because Hive was built to operate over HDFS data using MapReduce, where full-table scans are the norm and a table update is achieved by transforming the data into a new table. For a data warehousing application that runs over large portions of the dataset, this works well.
Hive doesn't support updates (or deletes), but it does support INSERT INTO, so it is possible to add new rows to an existing table.
To achieve your current need, you need to fire below query
> insert overwrite table student
> select *from student
> where id <> 1;
This will delete current table and create new table with same name with all rows except the rows that you want to exclude/delete
I tried this on Hive 1.2.1
There are few properties to set to make a Hive table support ACID properties and to support UPDATE ,INSERT ,and DELETE as in SQL
Conditions to create a ACID table in Hive.
1. The table should be stored as ORC file .Only ORC format can support ACID prpoperties for now
2. The table must be bucketed
Properties to set to create ACID table:
set hive.support.concurrency =true;
set hive.enforce.bucketing =true;
set hive.exec.dynamic.partition.mode =nonstrict
set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads= 1;
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set the property hive.in.test to true in hive.site.xml
After setting all these properties , the table should be created with tblproperty 'transactional' ='true'. The table should be bucketed and saved as orc
CREATE TABLE table_name (col1 int,col2 string, col3 int) CLUSTERED BY col1 INTO 4
BUCKETS STORED AS orc tblproperties('transactional' ='true');
Now the Hive table can support UPDATE and DELETE queries
Delete has been recently added in Hive version 0.14
Deletes can only be performed on tables that support ACID
Below is the link from Apache .
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Delete
Good news,Insert updates and deletes are now possible on Hive/Impala using Kudu.
You need to use IMPALA/kudu to maintain the tables and perform insert/update/delete records.
Details with examples can be found here:
insert-update-delete-on-hadoop
Please share the news if you are excited.
-MIK
Recently I was looking to resolve a similar issue, Apache Hive, Hadoop do not support Update/Delete operations. So ?
So you have two ways:
Use a backup table: Save the whole table in a backup_table, then truncate your input table, then re-write only the data you are intrested to mantain.
Use Uber Hudi: It's a framework created by Uber to resolve the HDFS limitations including Deletion and Update. You can give a look in this link:
https://eng.uber.com/hoodie/
an example for point 1:
Create table bck_table like input_table;
Insert overwrite table bck_table
select * from input_table;
Truncate table input_table;
Insert overwrite table input_table
select * from bck_table where id <> 1;
NB: If the input_table is an external table you must follow the following link:
How to truncate a partitioned external table in hive?
If you want to perform Hive CRUD using ACID operations, you need check whether you have
hive 0.14 version or not
In order to perform CREATE, SELECT, UPDATE, DELETE, We have to ensure while creating the table with the following conditions
File format should be in ORC file format with
TBLPROPERTIES(‘transactional’=’true’)
Table should be CLUSTERED BY
with some Buckets, please refer the below CREATE TABLE statement.
You can use below query to create table with above properties-
CREATE TABLE STUDENT
(
STD_ID INT,
STD_NAME STRING,
AGE INT,
ADDRESS STRING
)
CLUSTERED BY (ADDRESS) into 3 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED as orc tblproperties('transactional'='true');

How do I Drop Table from slony

I have a database which is being backed up by slony. I dropped a table from the replicated DB and re-created the same table using sql scripts and nothing through slony scripts.
I found this on a post and tried it:
Recreate the table
Get the OID for the recreated table: SELECT OID from pg_class WHERE relname = <your_table>' AND relkind = 'r';
Update the tab_reloid in sl_table for the problem table.
Execute SET DROP TABLE ( ORIGIN = N, ID = ZZZ); where N is the NODE # for the MASTER, and ZZZ is the ID # in sl_table.
But it doesn't seem to work.
How do I drop the table from the replicated DB? Or is there a way to use the newly created table in place of the old one?
The authoritative documentation on dropping things from Slony is here.
It's not really clear what state things were in before you ran the commands above, and you haven't clarified "doesn't seem to work".
There is one significant "gotcha" that I know off with dropping tables from replication with Slony. After you remove a table from replication, you can have trouble actually physically dropping the table on the slaves (but not on the master) with Slony 1.2, getting a cryptic error like this:
ERROR: "table_pkey" is an index
This may be fixed in Slony 2.0, but the problem here is that there is a referential integrity relationship between the unreplicated table on the slave and the replicated table, and slony 1.2 has intentionally corrupted the system table some as part of it's design, causing this issue.
A solution is to run the "DROP TABLE" command through slonik_execute_script. If you have already physically dropped the table on the master, you can use the option "EXECUTE ONLY ON" to run the command only on a specific slave. See the docs for EXECUTE SCRIPT for details.
you have dropped the table from the database but you haven't dropped from the _YOURCLUSTERNAME.sl_table.
It's importatnt de "_" before YOURCLUSTERNAME.
4 STEPS to solve the mess:
1. Get the tab_id
select tab_id from _YOURCLUSTERNAME.sl_table where tab_relname='MYTABLENAME' and tab_nspname='MYSCHEMANAME'
It returna a number 2 in MYDATABASE
2. Delete triggers
select _YOURCLUSTERNAME.altertablerestore(2);
This can return an error. Because It's trying to delete triggers in the original table, and now there is a new one.
3. Delete slony index if were created
select _YOURCLUSTERNAME.tableDropKey(2);
This can return an error.
Because It's trying to delete a index in the original table, and now there is a new table.
4. Delete the table from sl_table
delete from _YOURCLUSTERNAME.sl_table where tab_id = 2;
The best way for dropping a table is:
1. Drop the table form the cluster:
select tab_id from _YOURCLUSTERNAME.sl_table where tab_relname='MYTABLENAME' and tab_nspname='MYSCHEMANAME'
It returna a number 2 in MYDATABASE
Execute with slonik < myfile.slonik
where myfile.slonik is:
cluster name=MYCLUSTER;
NODE 1 ADMIN CONNINFO = 'dbname=DATABASENAME host=HOST1_MASTER user=postgres port=5432';
NODE 2 ADMIN CONNINFO = 'dbname=DATABASENAME host=HOST2_SLAVE user=postgres port=5432';
SET DROP TABLE (id = 2, origin = 1);
2 is the tab_id from sl_table and 1 is NODE 1, HOST1_MASTER
2. Drop the table from slave
with SQL DROP TABLE

Resources