How to change data type for column on partioned external Hive table (parquet) without deleting data? - hadoop

I have a partitioned external Hive table. It has data loaded from parquet files. I have a few columns in that table that require a datatype change (TIMESTAMP -> STRING). Currently, when you query these columns, it returns NULL values because of the wrong data-type.
I ran ALTER TABLE table_name CHANGE col_1 col_1 STRING; to successfully change the datatype for that column to STRING, but when I query the table again, the data in that table is still showing NULL. Is there a way to update the data without dropping the partitions and re-loading the data from scratch?

Related

How to delete fields from a partitioned table in Hive stored as parquet?

I'm looking for a way to modify a parquet data table in HIVE to remove some fields. The table is managed but it doesn't matter because I can convert it to external.
The problem is that I can not use the command ALTER TABLE ... REPLACE COLUMN with partitioned parquet tables.
It is works well for textfile format (partitioned or not) and only for non-partitioned parquet tables.
I've tried to replace column but this is the result:
hive> ALTER TABLE db_test.mytable REPLACE COLUMNS(name String);
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
Replacing columns cannot drop columns for table db_test.mytable.
SerDe may be incompatible
I've thought about some solutions, but none of them fits my scenario:
First
- [Optional] Convert the table in external.
- Delete the table.
- Re-create the table with the fields that I want.
- MSCK REPAIR TABLE to add HDFS partitions.
- [Optional] Convert back to managed table.
Second
- Create temporary table as selection of the original table with the fields that I choose.
- Delete the original table.
- Rename the temporary table to the original name.
Both options affect my process because I would lose the statistics of my table. This table is consumed with MicroStrategy by Impala and I need to mantain the statistics.
In addition, the second solution is bad with very large data tables.
Any suggestions?
Thanks in advance.
You can use first method and then run
hive> anayze table <db_name>.<table_name> compute statistics;
to compute all the statistics of the table.

HIVE partitioned by column becomes all 0 after inserting data from another table

I am using Hortonworks to create partitioned table in HIVE and insert data into it using another table in HIVE. The problem is, after I inserted data into the table I created, all values in the partitioned column (passenger_count) in the resulting table shows 0 even though none of the values in the original table are 0.
Below are the steps I have taken to create the partitioned table and insert data into it:
Run the following query to create table called 'date_partitioned':
create table date_partitioned
(tpep_dropoff_datetime string, trip_distance double)
partitioned by (passenger_count int);
Run the following query to insert data into 'date_partitioned' table, from another existing table:
INSERT INTO TABLE date_partitioned
PARTITION (passenger_count)
SELECT tpep_dropoff_datetime, trip_distance, passenger_count
FROM trips_raw;
The column types and sample values of the 'trips_raw' are shown in the screenshots below:
As you can see, the 'passenger_count' column is int type and contains non-zero values. But when I look at the results from the 'date_partitioned' table, the values from the 'passenger_count' column all show 0. The table also created a duplicate 'passenger_count' (so it has 2 'passenger_count' columns, one of which is empty). You can see from the screenshot below:
Any advise would be greatly appreciated. I am curious as to why the 'passenger_count' show 0 in the resulted table when the original column has no 0, and why there's an additional 'passenger_count' column in the resulted table.
Are you sure that all rows loaded for passenger_count is 0? Can you do a COUNT and GROUP BY passenger_count on both tables? Maybe you're just sampling all zeroes?

Create partitioned table from non partitioned table

Suppose I have internal orc non partitioned table in Hive:
CREATE TABLE IF NOT EXISTS non_partitioned_table(
id STRING,
company STRING,
city STRING,
country STRING,
)
STORED AS ORC;
Is it possible somehow create parquet partitioned table this way via cte like statement?
create partitioned_table PARTITION ON (date STRING) like non_partitioned_table;
alter table partitioned_table SET FILEFORMAT PARQUET;
This create statement doesn't work.
So basically I need to add column and make table partitioned by this column. I know that I can create table through the simple create table statement, but I need to do it within CREATE TABLE LIKE and the altered somehow
Your table doesn't have a date column to begin with, so you're going to have to make a new one.
You might be able to ALTER TABLE non_partitioned_table ADD PARTITION, but haven't tried that myself. If you want to try it, I would suggest the partition location be outside of the existing HDFS directory.
Anyways, the CREATE-TABLE-LIKE DDL does not support PARTITIONED BY
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
You need to copy the DESCRIBE TABLE schema from the first, then alter it and add the PARTITIONED BY, and optionally specify STORED AS. (SET FILEFORMAT PARQUET doesn't change the data type in-place).
Then, if you want the data in the new table, you need to INSERT OVERWRITE TABLE

How to query sorted/indexed columns in Impala

I have to make a POC with Hadoop for a database using interactive query (~300To log database). I'm trying Impala but i didn't find any solution to use sorted or indexed data. I'm a newbie so i don't even know if it is possible.
How to query sorted/indexed columns in Impala ?
By the way, here is my table's code (simplified).
I would like to have a fast access on the "column_to_sort" below.
CREATE TABLE IF NOT EXISTS myTable (
unique_id STRING,
column_to_sort INT,
content STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073'
STORED AS textfile;

Create new hive table from existing external portioned table

I have a external partitioned table with almost 500 partitions. I am trying to create another external table with same properties as of the old table. Then i want to copy all the partitions from my old table to the newly created table. below is my create table query. My old table is stored as TEXTFILE and i want to save the new one as ORC file.
'add jar json_jarfile;
CREATE EXTERNAL TABLE new_table_orc (col1,col2,col3...col27)
PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (....)
STORED AS orc
LOCATION 'path';'
And after creation of this table. i am using the below query to insert the partitions from old table to new one.i only want to copy few columns from original table to new table
'set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE new_table_orc PARTITION (year,month,day) SELECT col2,col3,col6,year,month,day FROM old_table;
ALTER TABLE new_table_orc RECOVER PARTITIONS;'
i am getting below error.
'FAILED: SemanticException [Error 10044]: Line 2:23 Cannot insert into target table because column number/types are different 'day': Table insclause-0 has 27 columns, but query has 6 columns.'
Any suggestions?
Your query has to match the number and type of columns in your new table. You have created your new table with 27 regular columns and 3 partition columns, but your query only select six columns.
If you really only care about those six columns, then modify the new table to have only those columns. If you do want all columns, then modify your select statement to select all of those columns.
You also will not need the "recover partitions" statement. When you insert into a table with dynamic partitions, it will create those partitions both in the filesystem and in the metastore.

Resources