Detecting CSV Headers when creating a DataBricks Delta Table? - azure-databricks

Needless to say, I'm new to Spark DataBricks and Delta.
I'm trying to create a Delta table using %sql from a simple csv where the first row is a header row. Unfortunately I can't seem to get the initial CREATE TABLE to recognise the header column in the CSV (Just to note, I've been using the DataBricks quickstart as a guide - https://docs.databricks.com/delta/quick-start.html)
The code I've got in my Databricks notebook is
%sql
CREATE TABLE people
USING delta
LOCATION '/dbfs/mnt/mntdata/DimTransform/People.csv'
I have tried using the TBLPROPERTIES ("headers" = "true") but with no success - see below
%sql
CREATE TABLE people
USING delta
TBLPROPERTIES ("headers" = "true")
AS SELECT *
FROM csv.'/mnt/mntdata/DimTransform/People.csv'
In both cases, the csv data is loaded into the table but the header row is just included in the data as the first standard row.
Any ideas how I'd get this %sql CREATE TABLE to recognise the first/header row as a header when loading from a csv?
Thanks

Maybe you have to do a small workaround because you are using CSV file, not a JSON or PARQUET, These files have schema and csv no.
So I suggest to do that::
%sql
drop table if exists tempPeopleTable ;
CREATE TABLE tempPeopleTable
USING csv
OPTIONS (path "/mnt/mntdata/DimTransform/People.csv", header "true", inferSchema "true");
CREATE TABLE people
USING delta
AS SELECT * FROM tempPeopleTable;
drop table if exists tempPeopleTable;

Related

How to create an AWS Athena Table with Partition Projection and Bucketing enabled?

I am trying to Create an Athena Table that makes use of both Projected Partitioning and Bucketing (CLUSTERED BY). I'm doing this to get a side by side performance comparison for our dataset with and without using Bucketing. Through my tests, this does not seem to be supported. But I cannot find anything in the documentation that explicitly states this. So I'm assuming that I'm missing something. Bucketing works with normal Partitioning, but I'm trying to make use Projected Partitioning so that I do not have to maintain the Partitions in the Glue Catalog.
This my setup. I have an existing Athena Table that is setup to read Gzipped Parquet files on S3. This all works. In order to create the Bucketed version of my Table(s), I'm using Athena CTAS to create Bucketed Gzipped Parquet Files. The CTAS files are written to a staging location and then I move them to a location that fits my storage structure. I then try to create a new Table that points to the bucketed data and try to enable Projected Partitioning and Bucketing in the Table setup. I've used both Athena SQL and AWS Wrangler's create_parquet_table to do this.
Here is the original CTAS SQL that creates the Bucketed Files:
CREATE TABLE database_name.ctas_table_name
WITH (
external_location = 's3://bucket/staging_prefix',
partitioned_by = ARRAY['partition_column'],
bucketed_by = ARRAY['index_column'],
bucket_count = 10,
format = 'PARQUET'
)
AS
SELECT
index_column,
partition_column
FROM database_name.table_name;
The files produced from the above CTAS are then moved from the staging location to the actual location, let's call it 's3://bucket/table_prefix'. This results in a s3 structure like:
s3://bucket/table_prefix/partition_column=xx/file.bucket_000.gzip.parquet
s3://bucket/table_prefix/partition_column=xx/file.bucket_001.gzip.parquet
...
s3://bucket/table_prefix/partition_column=xx/file.bucket_009.gzip.parquet
So 10 Bucketed file per partition.
Then the SQL to create the Table on Athena
CREATE TABLE database_name.table_name (
index_column bigint,
partition_column bigint
)
CLUSTERED BY (index_column) INTO 10 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://bucket/table_prefix'
TBLPROPERTIES (
'classification'='parquet',
'compressionType'='gzip',
'projection.enabled'='true',
'projection.index_column.type'='integer',
'projection.index_column.range'='0,3650',
'typeOfData'='file'
);
If I submit this last CREATE TABLE SQL, it succeeds. However, when selecting from the table, I get the following error message:
HIVE_INVALID_BUCKET_FILES: Hive table 'database_name.table_name' is corrupt. Found sub-directory in bucket directory for partition: <UNPARTITIONED>
If I try to create the Table using the affore mentioned awswrangler.catalog.create_parquet_table, which looks like this:
response = awswrangler.catalog.create_parquet_table(
boto3_session = boto3_session,
database = 'database_name',
table = 'table_name',
path = 's3://bucket/table_prefix',
partitions_types = {"partition_column": "bigint"},
columns_types = {"index_column": "bigint", "partition_column": "bigint"},
bucketing_info = (["index_column"], 10),
compression = 'gzip',
projection_enabled = True,
projection_types = {"partition_column": "integer"},
projection_ranges = {"partition_column": "0,3650"},
)
This API call raises the following exception:
awswrangler.exceptions.InvalidArgumentCombination: Column index_column appears as projected column but not as partitioned column.
This does not make sense, as it clearly is there. I believe this to be a red herring in any case. If I remove the <bucketing_info> parameter, it all works. Inversely, if I remove the <projection...> parameters, it all works.
So from what I can gather, Partition Projection is not compatible with Bucketing. But this is not made clear in the documentation, nor could I find anything online to support this. So I'm asking here if anyone knows what is going on?
Did I make a mistake in my setup?
Did I miss a piece of AWS Athena documentation that states this is not possible.
Or is this an undocumented incompatibility?
Or... aliens??

Refresh hive tables in Hive

I have few tables in Hive, every day new csv file will be adding to the hive table location. When a new data is available i need to refresh the tables so that i can see new data in the tables.
steps we follow to load the data:
first create a table with csv serde properties
create another table with parquet table to do in production
insert the data from first table to second table.
Initial:
1,a
2,b
3,c
New file:
4,d
I searched in google and found this can be done via:
1) incremental table, loading the new file in to incremental table and do insert statement. In my case we have more than 100 tables and so not want to create these many incremental tables
2) Using refresh command via Impala shell.
Our initial tables are stored as csv serde format. so when i do refresh on the initial tables i get an error impala does't support serde propertied.
Can you please provide a solution in my case.

External hive table as parquet file returns NULL when queried

I created a .parquet file by using map reduce job. Now I want to create an external table on top of this file. Here is the command:
CREATE EXTERNAL TABLE testparquet (
NAME STRING,
AGE INT
)
STORED AS PARQUET
LOCATION 'file location'
The table is created successfully but when I query the table using simple SELECT * , I get data as NULL for all fields. The version of hive is 0.13.
Is there anything that I am missing?
When using external files, you need to explicitly synchronize the metadata store that knows about the schema of your data, with the actual data itself.
Typically, you'll use the INVALIDATE METADATA command to force following queries to re-read the data. You can also use REFRESH <table-name> if you have just one table that has been updated.

Load from HIVE table into HDFS as AVRO file

I want to load a file into HDFS (as .avro file) from HIVE table.
Currently I am able to move a table as a file from HIVE to HDFS but I am not able to specify a particular format of my Target file. can some one help me in this.??
So your question is really
How do I convert a Hive table to a different storage format?
Create a new table with the same fields and types as the avro table and change the input format. Then insert into the new table from the old table.
INSERT OVERWRITE TABLE newtable SELECT * FROM oldtable

How to create table dynamically based on the uploaded csv file column header using oracle apex

Based in the csv file column header it should create table dynamically and also insert records of that csv file into the newly create table.
Ex:
1) If i upload a file TEST.csv with 3 columns, it should create a table dynamically with three
2) Again if i upload a new file called TEST2.csv with 5 columns, it should create a table dynamically with five columns.
Every time it should create a table based on the uploaded csv file header..
how to achieve this in oracle APEX..
Thanks in Advance..
Without creating new tables you can treat the CSVs as tables using a TABLE function you can SELECT from. If you download the packages from the Alexandria Project you will find a function that will do just that inside CSV_UTIL_PKG (clob_to_csv is this function but you will find other goodies in here).
You would just upload the CSV and store in a CLOB column and then you can build reports on it using the CSV_UTIL_PKG code.
If you must create a new table for the upload you could still use this parser. Upload the file and then select just the first row (e.g. SELECT * FROM csv_util_pkg.clob_to_csv(your_clob) WHERE ROWNUM = 1). You could insert this row into an Apex Collection using APEX_COLLECTION.CREATE_COLLECTION_FROM_QUERY to make it easy to then iterate over each column.
You would need to determine the datatype for each column but could just use VARCHAR2 for everything.
But if you are just using generic columns you could just as easily just store one addition column as a name of this collection of records and store all of the uploads in the same table. Just build another table to store the column names.
Simply store this file as BLOB if structure is "dynamic".
You can use XML data type for this use case too but it won't be very different from BLOB column.
There is a SecureFile feature since 11g, It is a new BLOB implementation, it performs better than regular BLOB and it is good for unstructured or semi structured data.

Resources