How to create an AWS Athena Table with Partition Projection and Bucketing enabled? - parquet

I am trying to Create an Athena Table that makes use of both Projected Partitioning and Bucketing (CLUSTERED BY). I'm doing this to get a side by side performance comparison for our dataset with and without using Bucketing. Through my tests, this does not seem to be supported. But I cannot find anything in the documentation that explicitly states this. So I'm assuming that I'm missing something. Bucketing works with normal Partitioning, but I'm trying to make use Projected Partitioning so that I do not have to maintain the Partitions in the Glue Catalog.
This my setup. I have an existing Athena Table that is setup to read Gzipped Parquet files on S3. This all works. In order to create the Bucketed version of my Table(s), I'm using Athena CTAS to create Bucketed Gzipped Parquet Files. The CTAS files are written to a staging location and then I move them to a location that fits my storage structure. I then try to create a new Table that points to the bucketed data and try to enable Projected Partitioning and Bucketing in the Table setup. I've used both Athena SQL and AWS Wrangler's create_parquet_table to do this.
Here is the original CTAS SQL that creates the Bucketed Files:
CREATE TABLE database_name.ctas_table_name
WITH (
external_location = 's3://bucket/staging_prefix',
partitioned_by = ARRAY['partition_column'],
bucketed_by = ARRAY['index_column'],
bucket_count = 10,
format = 'PARQUET'
)
AS
SELECT
index_column,
partition_column
FROM database_name.table_name;
The files produced from the above CTAS are then moved from the staging location to the actual location, let's call it 's3://bucket/table_prefix'. This results in a s3 structure like:
s3://bucket/table_prefix/partition_column=xx/file.bucket_000.gzip.parquet
s3://bucket/table_prefix/partition_column=xx/file.bucket_001.gzip.parquet
...
s3://bucket/table_prefix/partition_column=xx/file.bucket_009.gzip.parquet
So 10 Bucketed file per partition.
Then the SQL to create the Table on Athena
CREATE TABLE database_name.table_name (
index_column bigint,
partition_column bigint
)
CLUSTERED BY (index_column) INTO 10 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://bucket/table_prefix'
TBLPROPERTIES (
'classification'='parquet',
'compressionType'='gzip',
'projection.enabled'='true',
'projection.index_column.type'='integer',
'projection.index_column.range'='0,3650',
'typeOfData'='file'
);
If I submit this last CREATE TABLE SQL, it succeeds. However, when selecting from the table, I get the following error message:
HIVE_INVALID_BUCKET_FILES: Hive table 'database_name.table_name' is corrupt. Found sub-directory in bucket directory for partition: <UNPARTITIONED>
If I try to create the Table using the affore mentioned awswrangler.catalog.create_parquet_table, which looks like this:
response = awswrangler.catalog.create_parquet_table(
boto3_session = boto3_session,
database = 'database_name',
table = 'table_name',
path = 's3://bucket/table_prefix',
partitions_types = {"partition_column": "bigint"},
columns_types = {"index_column": "bigint", "partition_column": "bigint"},
bucketing_info = (["index_column"], 10),
compression = 'gzip',
projection_enabled = True,
projection_types = {"partition_column": "integer"},
projection_ranges = {"partition_column": "0,3650"},
)
This API call raises the following exception:
awswrangler.exceptions.InvalidArgumentCombination: Column index_column appears as projected column but not as partitioned column.
This does not make sense, as it clearly is there. I believe this to be a red herring in any case. If I remove the <bucketing_info> parameter, it all works. Inversely, if I remove the <projection...> parameters, it all works.
So from what I can gather, Partition Projection is not compatible with Bucketing. But this is not made clear in the documentation, nor could I find anything online to support this. So I'm asking here if anyone knows what is going on?
Did I make a mistake in my setup?
Did I miss a piece of AWS Athena documentation that states this is not possible.
Or is this an undocumented incompatibility?
Or... aliens??

Related

How to import data from parquet file to existing Hadoop table?

I have created some tables in my Hadoop cluster, and I have some parquet tables with data to put it in. How do I perform this? I want to stress, that I already have empty tables, created with some DDL commands, and they are also stored as parquet, so I don't have to create tables, only to import data.
You should take advantage of a hive feature that enables you to use parquet to import data. Even if you don't want to create a new table. I think it's implied that the parquet table schema is the same as the existing empty table. If this isn't the case then below won't work as is. You will have to select the columns that you need. There
Here the table that you already have this is empty is called emptyTable located in myDatabase. The new data you want to add is located /path/to/parquet/hdfs_path_of_parquet_file
CREATE TABLE myDatabase.my_temp_table
LIKE PARQUET '/path/to/parquet/hdfs_path_of_parquet_file'
STORED AS PARQUET
LOCATION '/path/to/parquet/';
INSERT INTO myDatabase.emptyTable as
SELECT * from myDatabase.my_temp_table;
DELETE TABLE myDatabase.my_temp_table;
You said you didn't want to create tables but I think the above kinda cheats around your ask.
The other option again assuming the schema for parquet is already the same as the table definition that is empty that you already have:
ALTER TABLE myDatabase.emptyTable SET LOCATION '/path/to/parquet/';
This technically isn't creating a new table but does require altering you table you already created so I'm not sure if that's acceptable.
You said this is a hive things so I've given you hive answer but really if emptyTable table definition understands parquet in the exact format that you have the /path/to/parquet/hdfs_path_of_parquet_file in you could just drop this file into the folder defined by the table definition:
show create table myDatabase.emptyTable;
This would automatically add the data to the existing table. Provided the table definition matched. Hive is Schema on read so you don't actually need to "import" only enable hive to "interpret".

How to use external table in hive?

Can anyone please explain why and where do we use external tables in hive?
Please explain a scenario to understand easily.
We use external table when our underlying dataset pointed by hive table is shared by many purpose i.e for map reduce job, pig etc and use managed table in hive when our dataset pointed by hive table is used only by hive application.
Actually in hive managed table has full control on dataset i.e in managed table if you will drop the table dataset will also be deleted from hive warehouse(/usr/hive/warehouse) present in HDFS, but in case of external table when you drop the table, dataset are not deleted from hive warehouse in HDFS.
Suppose take an example you have 50 gb data set now if you create multiple copies of dataset for different purpose it will simply take more space so the better option is to use external table so that when you drop the table dataset are not deleted and you can use it further by any other application like by pig or by any other purpose.
As a rule of thumb: use external table if you plan to work with those data not only from Hive but from other frameworks as well. Otherwise make it internal.
The only difference between External and Managed table in Hive is Drop table or Drop partition behavior. For Managed it will drop data as well, for External table the data will remain untouched in the table/partition location.
Use External in most cases. External table allows you to change table definition easily. Also you can create few tables on top of the same location.
Use Managed table if the table is temporary/intermediate and data should be deleted to free space.
Managed table can be converted to external and vice-versa using
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');

Unable to partition hive table backed by HDFS

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore

Hive - Hbase integration Transactional update with timestamp

I am new to hadoop and big data, just trying to figure out the possibilities to move my Data store to hbase these days, and I have come across a problem, which some of you might be able to help me with. So its like,
I have a hbase table "hbase_testTable" with Column Family : "ColFam1". I have set the version of "ColFam1" to 10, as I have to maintain history upto 10 updates to this column family. Which works fine. When I try to add new rows through hbase shell with explicit timestamp value it works fine. Basically I want to use the timestamp as my version control. So I specify the time stamp as
put 'hbase_testTable' '1001','ColFam1:q1', '1000$', 3
where '3' is my version. And everything works fine.
Now I am trying to integrate with HIVE external table, and I have all mappings well set to match that of hbase table like below :
create external table testtable (id string, q1 string, q2 string, q3 string)
STOREd BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES ("hbase.columns.mapping" = ":key,colfam1:q1, colfam1:q2, colfam1:q3")
TBLPROPERTIES("hbase.table.name" = "testtable", "transactional" = "true");
And works fine with normal insertion. It updates the HBase table and vice-versa.
Even though the external table is made "Transactional", I am not able to update the data on HIVE. It gives me an error :
FAILED: SemanticException [Error 10294]: Attempt to do update or delete
using transaction manager that does not support these operations
Said that, Any updates, made to the hbase tables are reflected immediately on the hive table.
I can update the Hbase table with hive external table by trying to insert into the hive external table for the "rowid" with new data for the column.
Is it possible to I control the timestamp being written to the referenced hbase table ( like 4,5,6,7..etc) Please help.
The timestamp is one of important element in Hbase versioning. You are trying to create your own timestamp, which works fine at Hbase level.
One point, is you should be very careful, with unique and non-negative. You can look at Custom versioning in HBase-Definitve Guide book.
Now you have Hive on top of Hbase. As per documentation,
there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp.
Thats for the reading part. And for putting data, you can look here.
It still says that, you have to give valid time stamp and not any other value.
The future versions are expected to expose the timestamp attribute.
I hope you got a better idea regarding how to deal with custom timestamp in Hive-Hbase integration.

Hadoop & Hive as warehouse: daily data deliveries

I am evaluating the combination of hadoop & hive (& impala) as a repolacement for a large data warehouse. I already set up a version and performance is great in read access.
Can somebody give me any hint what concept should be used for daily data deliveries to a table?
I have a table in hive based on a file I put into hdfs. But now I have on a daily basis new transactional data coming in.
How do I add them ti the table in hive.
Inserts are not possible. HDFS cannot append. So whats the gernal concept I need to follow.
Any advice or direction to documentation is appreciated.
Best regards!
Hive allows for data to be appended to a table - the underlying implementation of how this happens in HDFS doesn't matter. There are a number of things you can do append data:
INSERT - You can just append rows to an existing table.
INSERT OVERWRITE - If you have to process data, you can perform an INSERT OVERWRITE to re-write a table or partition.
LOAD DATA - You can use this to bulk insert data into a table and, optionally, use the OVERWRITE keyword to wipe out any existing data.
Partition your data.
Load data into a new table and swap the partition in
Partitioning is great if you know you're going to be performing date based searches and gives you the ability to use options 1, 2, & 3 at either the table or partition level.
Inserts are not possible
Inserts are possible ,like you can create a new table and insert the data from new table to old table.
But simple solution is You can load data of the file into Hive table with the below command.
load data inpath '/filepath' [overwrite] into table tablename;
If you use overwrite then only existing data replced with new data otherwise It is appending only.
You can even schedule the script by creating a shell script.

Resources