HIVE partitioned by column becomes all 0 after inserting data from another table - hadoop

I am using Hortonworks to create partitioned table in HIVE and insert data into it using another table in HIVE. The problem is, after I inserted data into the table I created, all values in the partitioned column (passenger_count) in the resulting table shows 0 even though none of the values in the original table are 0.
Below are the steps I have taken to create the partitioned table and insert data into it:
Run the following query to create table called 'date_partitioned':
create table date_partitioned
(tpep_dropoff_datetime string, trip_distance double)
partitioned by (passenger_count int);
Run the following query to insert data into 'date_partitioned' table, from another existing table:
INSERT INTO TABLE date_partitioned
PARTITION (passenger_count)
SELECT tpep_dropoff_datetime, trip_distance, passenger_count
FROM trips_raw;
The column types and sample values of the 'trips_raw' are shown in the screenshots below:
As you can see, the 'passenger_count' column is int type and contains non-zero values. But when I look at the results from the 'date_partitioned' table, the values from the 'passenger_count' column all show 0. The table also created a duplicate 'passenger_count' (so it has 2 'passenger_count' columns, one of which is empty). You can see from the screenshot below:
Any advise would be greatly appreciated. I am curious as to why the 'passenger_count' show 0 in the resulted table when the original column has no 0, and why there's an additional 'passenger_count' column in the resulted table.

Are you sure that all rows loaded for passenger_count is 0? Can you do a COUNT and GROUP BY passenger_count on both tables? Maybe you're just sampling all zeroes?

Related

How to change data type for column on partioned external Hive table (parquet) without deleting data?

I have a partitioned external Hive table. It has data loaded from parquet files. I have a few columns in that table that require a datatype change (TIMESTAMP -> STRING). Currently, when you query these columns, it returns NULL values because of the wrong data-type.
I ran ALTER TABLE table_name CHANGE col_1 col_1 STRING; to successfully change the datatype for that column to STRING, but when I query the table again, the data in that table is still showing NULL. Is there a way to update the data without dropping the partitions and re-loading the data from scratch?

How to delete fields from a partitioned table in Hive stored as parquet?

I'm looking for a way to modify a parquet data table in HIVE to remove some fields. The table is managed but it doesn't matter because I can convert it to external.
The problem is that I can not use the command ALTER TABLE ... REPLACE COLUMN with partitioned parquet tables.
It is works well for textfile format (partitioned or not) and only for non-partitioned parquet tables.
I've tried to replace column but this is the result:
hive> ALTER TABLE db_test.mytable REPLACE COLUMNS(name String);
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
Replacing columns cannot drop columns for table db_test.mytable.
SerDe may be incompatible
I've thought about some solutions, but none of them fits my scenario:
First
- [Optional] Convert the table in external.
- Delete the table.
- Re-create the table with the fields that I want.
- MSCK REPAIR TABLE to add HDFS partitions.
- [Optional] Convert back to managed table.
Second
- Create temporary table as selection of the original table with the fields that I choose.
- Delete the original table.
- Rename the temporary table to the original name.
Both options affect my process because I would lose the statistics of my table. This table is consumed with MicroStrategy by Impala and I need to mantain the statistics.
In addition, the second solution is bad with very large data tables.
Any suggestions?
Thanks in advance.
You can use first method and then run
hive> anayze table <db_name>.<table_name> compute statistics;
to compute all the statistics of the table.

How to drop hive column?

I have two columns Id and Name in Hive table, and I want to delete the Name column. I have used following command:
ALTER TABLE TableName REPLACE COLUMNS(id string);
The result was that the Name column values were assigned to the Id column.
How can I drop a specific column of the table and is there any other command in Hive to achieve my goal?
In addition to the existing answers to the question : Alter hive table add or drop column
As per Hive documentation,
REPLACE COLUMNS removes all existing columns and adds the new set of columns.
REPLACE COLUMNS can also be used to drop columns. For example, ALTER TABLE test_change REPLACE COLUMNS (a int, b int); will remove column c from test_change's schema.
The query you are using is right. But this will modify only schema i.e, the metastore. This will not modify anything on data side.
So, before you are dropping the column you should make sure that you hav correct data file.
In your case the data file should not contain name values.
If you don't want to modify the file then create another table with only specific column that you need.
Create table tablename as select id from already_existing_table
let me know if this helps.

hive first column to consider in partition table

Creating partition table in hive, does it mandatory to choose always the last column for partition column.
If I choose 1st column as partition, I cant do filter data, is there any way to choose first column for partition?
In hive, if you want to partition a table, you have to define partition column first during table creation time. & while populating the data into table you need to specify as follow:
"INSERT INTO partitioned_table PARTITION(status) SELECT id , name, status from temp_tbl "
in this way using you can partition based on last column only. if you want to partition on the basis of first column. you have to write a Mapreduce job for that . that is the only option available.
I guess the problem you are facing is that you already have table "source" in your local system or hdfs and you want to upload it to partitioned table. And you want the first column in the source table to be partitioned in hive. As the source table does not have headers i guess we can not do anything here if we try to directly upload the file in the hive destination folder. The only alternate way i know is that create a non partitioned table in hive whose structure is exactly the same as the source file. then upload the source data to non partitioned table first, then copy the data from non partitioned table to partitioned table.
Suppose the source file is like this
create table source(eid int, ename int, esal int) partitioned by (dept string)
your non partioned table where you upload the data is like thiscreate table nopart(dept string, esal int,ename string, eid int)
then you use the dynamic partition by command insert overwrite table source partition(dept) select eid,ename,esal,dept from nopart;
the order of the parameters is the only point here.

Create new hive table from existing external portioned table

I have a external partitioned table with almost 500 partitions. I am trying to create another external table with same properties as of the old table. Then i want to copy all the partitions from my old table to the newly created table. below is my create table query. My old table is stored as TEXTFILE and i want to save the new one as ORC file.
'add jar json_jarfile;
CREATE EXTERNAL TABLE new_table_orc (col1,col2,col3...col27)
PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (....)
STORED AS orc
LOCATION 'path';'
And after creation of this table. i am using the below query to insert the partitions from old table to new one.i only want to copy few columns from original table to new table
'set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE new_table_orc PARTITION (year,month,day) SELECT col2,col3,col6,year,month,day FROM old_table;
ALTER TABLE new_table_orc RECOVER PARTITIONS;'
i am getting below error.
'FAILED: SemanticException [Error 10044]: Line 2:23 Cannot insert into target table because column number/types are different 'day': Table insclause-0 has 27 columns, but query has 6 columns.'
Any suggestions?
Your query has to match the number and type of columns in your new table. You have created your new table with 27 regular columns and 3 partition columns, but your query only select six columns.
If you really only care about those six columns, then modify the new table to have only those columns. If you do want all columns, then modify your select statement to select all of those columns.
You also will not need the "recover partitions" statement. When you insert into a table with dynamic partitions, it will create those partitions both in the filesystem and in the metastore.

Resources