Case sensitive column names in Hive - hadoop

I am trying to create an external HIVE table with partitions. Some of my
column names have Upper case letters. This caused a problem while creating
tables since the values of column names with upper case letters were
returned as NULL. I then modified the ParquetSerDe in order for it to
handle this by using SERDEPROPERTIES and this was working with external tables (not partitioned). Now I am
trying to create an external table WITH partitions, and whenever I try to
access the upper case columns (Eg FieldName) I get this error.
select FieldName from tablename;
FAILED: RuntimeException Java. Lang.RuntimeException: cannot find field
FieldName from
[org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#4f45884b,
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#8f11f27,
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#77e8eb0e,
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#1dae4cd,
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#623e336d
]
Are there any suggestions you can think of? I cannot change the schema of the data source.
This is the command I use to create tables -
CREATE EXTERNAL TABLE tablename (fieldname string)
PARTITIONED BY (partion_name string)
ROW FORMAT SERDE 'path.ModifiedParquetSerDeLatest'
WITH SERDEPROPERTIES ("casesensitive"="FieldName")
STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
And then add partition:
ALTER TABLE tablename ADD PARTITION (partition_name='partitionvalue')
LOCATION '/path/to/data'

This is an old question, but the partition column has to be case sensitive because of the unix filesystem on which it gets stored.
path "/columnname=value/" is always different from path "/columnName=value/" in unix
So it should be considered a bad practice to rely on case insensitive column names for Hive.

Related

Rules to be followed before creating a Hive partitioned table

As part of my requirement, I have to create a new Hive table and insert into it programmatically. To do that, I have the following DDL to create a Hive table:
CREATE EXTERNAL TABLE IF NOT EXISTS countData (
tableName String,
ssn String,
hiveCount String,
sapCount String,
countDifference String,
percentDifference String,
sap_UpdTms String,
hive_UpdTms String)
COMMENT 'This table contains record count of corresponding tables of all the source systems present on Hive & SAP'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '';
To insert data into a partition of a Hive table I can handle using an insert query from the program. Before creating the table, in the above DDL, I haven't added the "PARTITIONED BY" column as I am not totally clear with the rules of partitioning a Hive table. Couple of rules I know are
While inserting the data from a query, partition column should be the last one.
PARTITIONED BY column shouldn't be an existing column in the table.
Could anyone let me know if there are any other rules for partitioning a Hive table ?
Also in my case, we run the program twice a day to insert data into the table and every time it runs, there could be 8k to 10k records. I am thinking of adding a PARTITIONED BY column for current date (just "mm/dd/yyyy") and inserting it from the code.
Is there a better way to implement the partition idea for my requirement, if adding a date (String format) is not recommended ?
What you mentioned is fine, but I would recommend yyyyMMdd format because it sorts better and is more standardized than seeing 03/05 and not knowing which is the day, and what is the month.
If you want to run it twice a day, and you care about the time the job runs, then do PARTITIONED BY (dt STRING, hour STRING)
Also, don't use STORED AS TEXT. Use Parquet or ORC instead.

Hive: Does hive support partitioning and bucketing while usiing external tables

On using PARTITIONED BY or CLUSTERED BY keywords while creating Hive tables,
hive would create separate files corresponding to each partition or bucket. But for external tables is this still valid. As my understanding is data files corresponding to external files are not managed by hive. So does hive create additional files corresponding to each partition or bucket and move corresponding data in to these files.
Edit - Adding details.
Few extracts from "Hadoop: Definitive Guide" - "Chapter 17: Hive"
CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
When we load data into a partitioned table, the partition values are specified explicitly:
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
At the filesystem level, partitions are simply nested sub directories of the table directory.
After loading a few more files into the logs table, the directory structure might look like this:
The above table was obviously a managed table, so hive had the ownership of data and created a directory structure for each partition as in the above tree structure.
In case of external table
CREATE EXTERNAL TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
Followed by same set of load operations -
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
How will hive handle these partitions. As for external tables with out partition, hive will simply point to the data file and fetch any query result by parsing the data file. But in case of loading data in to a partitioned external table, where are the partitions created.
Hope fully in hive warehouse? Can someone support or clarify this?
Suppose partitioning on date as this is a common thing to do.
CREATE EXTERNAL TABLE mydatabase.mytable (
var1 double
, var2 INT
, date String
)
PARTITIONED BY (date String)
LOCATION '/user/location/wanted/';
Then add all your partitions;
ALTER TABLE mytable ADD PARTITION( date = '2017-07-27' );
ALTER TABLE mytable ADD PARTITION( date = '2017-07-28' );
So on and so forth.
Finally you can add your data in the proper location. You will have an external partitioned file.
There is an easy way to do this.
Create your External Hive table first.
CREATE EXTERNAL TABLE database.table (
id integer,
name string
)
PARTITIONED BY (country String)
LOCATION 'xxxx';
Next you have to run a MSCK command (metastore consistency check)
msck repair table database.table
This command will recover all partitions that are available in your path and update the metastore. Now, if you run your query against your table, data from all partitions will be retrieved.

hive first column to consider in partition table

Creating partition table in hive, does it mandatory to choose always the last column for partition column.
If I choose 1st column as partition, I cant do filter data, is there any way to choose first column for partition?
In hive, if you want to partition a table, you have to define partition column first during table creation time. & while populating the data into table you need to specify as follow:
"INSERT INTO partitioned_table PARTITION(status) SELECT id , name, status from temp_tbl "
in this way using you can partition based on last column only. if you want to partition on the basis of first column. you have to write a Mapreduce job for that . that is the only option available.
I guess the problem you are facing is that you already have table "source" in your local system or hdfs and you want to upload it to partitioned table. And you want the first column in the source table to be partitioned in hive. As the source table does not have headers i guess we can not do anything here if we try to directly upload the file in the hive destination folder. The only alternate way i know is that create a non partitioned table in hive whose structure is exactly the same as the source file. then upload the source data to non partitioned table first, then copy the data from non partitioned table to partitioned table.
Suppose the source file is like this
create table source(eid int, ename int, esal int) partitioned by (dept string)
your non partioned table where you upload the data is like thiscreate table nopart(dept string, esal int,ename string, eid int)
then you use the dynamic partition by command insert overwrite table source partition(dept) select eid,ename,esal,dept from nopart;
the order of the parameters is the only point here.

hive: external partitioned table without location

Is it possible to create external partitioned table without location? I want to add all the locations later, together with partitions.
i tried:
CREATE EXTERNAL TABLE IF NOT EXISTS a.b
(line STRING)
COMMENT 'abc'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE
PARTITIONED BY day;
but i got ParseException: missing EOF at 'PARTITIONED' near 'TEXTFILE'
I don't think so, as said in alter location.
But anyway, i think your query as some errors and the correct script would be :
CREATE EXTERNAL TABLE IF NOT EXISTS a.b
(line STRING)
COMMENT 'abc'
PARTITIONED BY (day String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE
;
I think the issue is that you have not specified data type for your partition column "day". And you can create a HIVE external table without location and can use ALTER table options later to change the location.

How to query sorted/indexed columns in Impala

I have to make a POC with Hadoop for a database using interactive query (~300To log database). I'm trying Impala but i didn't find any solution to use sorted or indexed data. I'm a newbie so i don't even know if it is possible.
How to query sorted/indexed columns in Impala ?
By the way, here is my table's code (simplified).
I would like to have a fast access on the "column_to_sort" below.
CREATE TABLE IF NOT EXISTS myTable (
unique_id STRING,
column_to_sort INT,
content STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073'
STORED AS textfile;

Resources